GithubHelp home page GithubHelp logo

python run.py with data_root="/arrows_flickr30k" num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="vilt_200k_mlm_itm.ckpt" about vilt HOT 15 OPEN

dandelin avatar dandelin commented on July 21, 2024
python run.py with data_root="/arrows_flickr30k" num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="vilt_200k_mlm_itm.ckpt"

from vilt.

Comments (15)

hongzhenwang avatar hongzhenwang commented on July 21, 2024 3

1.Make Arrow file. Conversion scripts are located in vilt/utils/write_*.py. Run make_arrow functions to convert the dataset to pyarrow binary file.
2. export MASTER_ADDR="0.0.0.0"
export MASTER_PORT="8000"
export NODE_RANK=0

from vilt.

dandelin avatar dandelin commented on July 21, 2024

As error reports Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132), It's highly likely that your PyTorch and PyTorch-lightning versions mismatch with ours.

Please install the latest Pytorch version.

from vilt.

jkkishore1999 avatar jkkishore1999 commented on July 21, 2024

Can you please let me know the recommended PyTorch and PyTorch-ligthining versions?
I have already done with the step :
pip install -r requirements.txt
pip install -e .

still the above error came.

Do we need to modify requirements.txt?

from vilt.

dandelin avatar dandelin commented on July 21, 2024

@jkkishore1999
Pytorch is not in the requirements.txt, so you are using your own version of installed Pytorch.
Pytorch > 1.7 should work fine.

from vilt.

jkkishore1999 avatar jkkishore1999 commented on July 21, 2024

My pytorch version is 1.8. Still there are some other errors

python run.py with data_root=/data2/dsets/dataset num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="weights/vilt_200k_mlm_itm.ckpt"

WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank ().
INFO - lightning - Using environment variable NODE_RANK for node rank ().
ERROR - ViLT - Failed after 0:00:05!
Traceback (most recent call last):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/run.py", line 238, in call
self.result = self.main_function(*args)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "run.py", line 48, in main
trainer = pl.Trainer(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
return fn(self, **kwargs)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 347, in init
self.accelerator_connector.on_trainer_init(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 127, in on_trainer_init
self.trainer.node_rank = self.determine_ddp_node_rank()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

Can you please help?

from vilt.

jkkishore1999 avatar jkkishore1999 commented on July 21, 2024

Also sometimes, for the same execution, another error is coming,

WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Using native 16bit precision.
INFO - lightning - Using native 16bit precision.
Missing logger folder: result/finetune_irtr_f30k_randaug_seed0_from_vilt_200k_mlm_itm
WARNING - lightning - Missing logger folder: result/finetune_irtr_f30k_randaug_seed0_from_vilt_200k_mlm_itm
Global seed set to 0
INFO - lightning - Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0
ERROR - ViLT - Failed after 0:00:05!
Traceback (most recent call last):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/run.py", line 238, in call
self.result = self.main_function(*args)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "run.py", line 71, in main
trainer.fit(model, datamodule=dm)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 268, in ddp_train
self.trainer.call_setup_hook(model)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in call_setup_hook
self.datamodule.setup(stage_name)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/multitask_datamodule.py", line 34, in setup
dm.setup(stage)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 137, in setup
self.set_train_dataset()
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 76, in set_train_dataset
self.train_dataset = self.dataset_cls(
File "/others/cs16b114/ViLT/vilt/datasets/f30k_caption_karpathy_dataset.py", line 15, in init
super().init(*args, **kwargs, names=names, text_column_name="caption")
File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

from vilt.

dandelin avatar dandelin commented on July 21, 2024
  1. export MASTER_ADDR=$DIST_0_IP
    export MASTER_PORT=$DIST_0_PORT
    export NODE_RANK=$DIST_RANK
    please check you’ve set the above environment variables
  2. is your data located in data_root=/data2/dsets/dataset or data_root=/arrows_flickr30k?

from vilt.

jkkishore1999 avatar jkkishore1999 commented on July 21, 2024

Thanks for the help.

  1. Even after running above 3 export commands. Those 3 environment vairables are set to null only
    declare -x MASTER_ADDR=""
    declare -x MASTER_PORT=""
    declare -x NODE_RANK=""
  2. I have changed by data_root to /data2/dsets/dataset
    Can you please help?

from vilt.

dandelin avatar dandelin commented on July 21, 2024
packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''

The error above seems due to that DDP is not properly initialized.

Do you mean the export command did not change the environment variables?
Setting those environment variables is necessary for PyTorch-lightning to do the DDP training properly.
Please make sure that those variables are set. (you can check current environment variables using the env command)
Also, check out this guide

File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range

Also, this error was probably raised because the list tables is empty.
Please check the dataset file is in f"{data_dir}/{name}.arrow" in advance.

from vilt.

Miazzzzx avatar Miazzzzx commented on July 21, 2024

Have you solved your problem? I‘v encountered the same problem. If it is possible could you please tell me how to solve it.

from vilt.

631212502 avatar 631212502 commented on July 21, 2024

Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.

from vilt.

631212502 avatar 631212502 commented on July 21, 2024

Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.

I have found the reason, the address is missing ”“

from vilt.

ThompsonISAT avatar ThompsonISAT commented on July 21, 2024

Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.

I have found the reason, the address is missing ”“

Hi even if I add the "" for address, I still get the same error. Could you help me to fix it? Thank you so much!

from vilt.

XX1nn avatar XX1nn commented on July 21, 2024

@jkkishore1999 I also noticed that you are using num_gpus=1 num_nodes=1. I use the same parameters with you. Now I have reported the same error as you.

File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 137, in setup
self.set_train_dataset()
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 76, in set_train_dataset
self.train_dataset = self.dataset_cls(
File "/others/cs16b114/ViLT/vilt/datasets/f30k_caption_karpathy_dataset.py", line 15, in init
super().init(*args, **kwargs, names=names, text_column_name="caption")
File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

May I know how to resolve it in the end. Is it to first introduce environment variables and then determine whether the file exists? However, I use a non distributed training, how can I determine the parameters of environment variables(MASTER_ADDR="" MASTER_PORT="" NODE_RANK="")

from vilt.

XX1nn avatar XX1nn commented on July 21, 2024

1.Make Arrow file. Conversion scripts are located in vilt/utils/write_*.py. Run make_arrow functions to convert the dataset to pyarrow binary file. 2. export MASTER_ADDR="0.0.0.0" export MASTER_PORT="8000" export NODE_RANK=0
export MASTER_ADDR="0.0.0.0" export MASTER_PORT="8000" export NODE_RANK=0

@hongzhenwang
Thanks for your answer. Are the values you mentioned for non distributed applications? Is the meaning of 0.0.0.0 applicable to any IP? can i just use the value "0.0.0.0" and NODE_RANK=0 for my non distributed finetuing?

from vilt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.