Comments (15)
1.Make Arrow file. Conversion scripts are located in vilt/utils/write_*.py. Run make_arrow functions to convert the dataset to pyarrow binary file.
2. export MASTER_ADDR="0.0.0.0"
export MASTER_PORT="8000"
export NODE_RANK=0
from vilt.
As error reports Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
, It's highly likely that your PyTorch and PyTorch-lightning versions mismatch with ours.
Please install the latest Pytorch version.
from vilt.
Can you please let me know the recommended PyTorch and PyTorch-ligthining versions?
I have already done with the step :
pip install -r requirements.txt
pip install -e .
still the above error came.
Do we need to modify requirements.txt?
from vilt.
@jkkishore1999
Pytorch is not in the requirements.txt, so you are using your own version of installed Pytorch.
Pytorch > 1.7 should work fine.
from vilt.
My pytorch version is 1.8. Still there are some other errors
python run.py with data_root=/data2/dsets/dataset num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="weights/vilt_200k_mlm_itm.ckpt"
WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank ().
INFO - lightning - Using environment variable NODE_RANK for node rank ().
ERROR - ViLT - Failed after 0:00:05!
Traceback (most recent call last):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/run.py", line 238, in call
self.result = self.main_function(*args)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "run.py", line 48, in main
trainer = pl.Trainer(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
return fn(self, **kwargs)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 347, in init
self.accelerator_connector.on_trainer_init(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 127, in on_trainer_init
self.trainer.node_rank = self.determine_ddp_node_rank()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'
Can you please help?
from vilt.
Also sometimes, for the same execution, another error is coming,
WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Using native 16bit precision.
INFO - lightning - Using native 16bit precision.
Missing logger folder: result/finetune_irtr_f30k_randaug_seed0_from_vilt_200k_mlm_itm
WARNING - lightning - Missing logger folder: result/finetune_irtr_f30k_randaug_seed0_from_vilt_200k_mlm_itm
Global seed set to 0
INFO - lightning - Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0
ERROR - ViLT - Failed after 0:00:05!
Traceback (most recent call last):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/run.py", line 238, in call
self.result = self.main_function(*args)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "run.py", line 71, in main
trainer.fit(model, datamodule=dm)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 268, in ddp_train
self.trainer.call_setup_hook(model)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in call_setup_hook
self.datamodule.setup(stage_name)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/multitask_datamodule.py", line 34, in setup
dm.setup(stage)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 137, in setup
self.set_train_dataset()
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 76, in set_train_dataset
self.train_dataset = self.dataset_cls(
File "/others/cs16b114/ViLT/vilt/datasets/f30k_caption_karpathy_dataset.py", line 15, in init
super().init(*args, **kwargs, names=names, text_column_name="caption")
File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'
from vilt.
- export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
please check you’ve set the above environment variables - is your data located in data_root=/data2/dsets/dataset or data_root=/arrows_flickr30k?
from vilt.
Thanks for the help.
- Even after running above 3 export commands. Those 3 environment vairables are set to null only
declare -x MASTER_ADDR=""
declare -x MASTER_PORT=""
declare -x NODE_RANK="" - I have changed by data_root to /data2/dsets/dataset
Can you please help?
from vilt.
packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''
The error above seems due to that DDP is not properly initialized.
Do you mean the export
command did not change the environment variables?
Setting those environment variables is necessary for PyTorch-lightning to do the DDP training properly.
Please make sure that those variables are set. (you can check current environment variables using the env
command)
Also, check out this guide
File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range
Also, this error was probably raised because the list tables
is empty.
Please check the dataset file is in f"{data_dir}/{name}.arrow"
in advance.
from vilt.
Have you solved your problem? I‘v encountered the same problem. If it is possible could you please tell me how to solve it.
from vilt.
Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.
from vilt.
Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.
I have found the reason, the address is missing ”“
from vilt.
Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.
I have found the reason, the address is missing ”“
Hi even if I add the "" for address, I still get the same error. Could you help me to fix it? Thank you so much!
from vilt.
@jkkishore1999 I also noticed that you are using num_gpus=1 num_nodes=1
. I use the same parameters with you. Now I have reported the same error as you.
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 137, in setup
self.set_train_dataset()
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 76, in set_train_dataset
self.train_dataset = self.dataset_cls(
File "/others/cs16b114/ViLT/vilt/datasets/f30k_caption_karpathy_dataset.py", line 15, in init
super().init(*args, **kwargs, names=names, text_column_name="caption")
File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'
May I know how to resolve it in the end. Is it to first introduce environment variables and then determine whether the file exists? However, I use a non distributed training, how can I determine the parameters of environment variables(MASTER_ADDR="" MASTER_PORT="" NODE_RANK="")
from vilt.
1.Make Arrow file. Conversion scripts are located in vilt/utils/write_*.py. Run make_arrow functions to convert the dataset to pyarrow binary file. 2. export MASTER_ADDR="0.0.0.0" export MASTER_PORT="8000" export NODE_RANK=0
export MASTER_ADDR="0.0.0.0" export MASTER_PORT="8000" export NODE_RANK=0
@hongzhenwang
Thanks for your answer. Are the values you mentioned for non distributed applications? Is the meaning of 0.0.0.0 applicable to any IP? can i just use the value "0.0.0.0" and NODE_RANK=0 for my non distributed finetuing?
from vilt.
Related Issues (20)
- utils/write_<>.py: Is there any way to write to disk on the fly instead of loading the entire dataFrame into memory? HOT 1
- train customer data HOT 1
- How to use the modal-type embedding in the output of encoder? HOT 1
- About SBU Caption dataset HOT 1
- About MS-COCO pre-training dataset HOT 1
- How to set the config to create a stand_alone commandline demo ? HOT 1
- train on coco dataset HOT 7
- RuntimeError: CUDA error: invalid device function HOT 3
- Question about train on coco dataset HOT 1
- pretrain datasets
- The problem of fine-flickr30k
- What is the image resolution during VQA finetuning and pretraining?
- Mistakes in vqa_dict.json ?
- pyarrow.lib.ArrowInvalid: Not an Arrow file HOT 2
- fine-tuning ViLT for MLM task with a new dataset
- Can't the weight folder be opened before the pre-training is over?
- RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) HOT 2
- What could be the reason that the model weights are not updating while finetuning? HOT 2
- cannot import name 'Final' from 'typing' HOT 2
- AttributeError: 'TracebackException' object has no attribute 'exc_traceback' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vilt.