GithubHelp home page GithubHelp logo

Pretrain phase problem about albef HOT 11 CLOSED

salesforce avatar salesforce commented on May 16, 2024
Pretrain phase problem

from albef.

Comments (11)

LiJunnan1992 avatar LiJunnan1992 commented on May 16, 2024

Hi, I haven't met this problem, and the error message does not seem to point to any part of the pretraining code.

from albef.

haoshuai714 avatar haoshuai714 commented on May 16, 2024

Thanks! Maybe python and pytorch version not match problem; Could you provide Requirements file, such as python version, torch version,ect. Thank you!

from albef.

LiJunnan1992 avatar LiJunnan1992 commented on May 16, 2024

This code has been tested on Python 3.8 and pytorch 1.09.

from albef.

Junjie-Ye avatar Junjie-Ye commented on May 16, 2024

I have a problem at pretrain phase, such as:
Traceback (most recent call last):
File "Pretrain.py", line 215, in
Traceback (most recent call last):
File "Pretrain.py", line 215, in
main(args, config)
File "Pretrain.py", line 93, in main
main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
utils.init_distributed_mode(args)
File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
torch.distributed.barrier()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

from albef.

haoshuai714 avatar haoshuai714 commented on May 16, 2024

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

check your python and pytorch version!

from albef.

Junjie-Ye avatar Junjie-Ye commented on May 16, 2024

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong?
Thanks for your answer.

from albef.

haoshuai714 avatar haoshuai714 commented on May 16, 2024

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? Thanks for your answer.

This author version: This code has been tested on Python 3.8 and pytorch 1.09.

from albef.

haoshuai714 avatar haoshuai714 commented on May 16, 2024

If you could provide a "environment.yml" file, which contains "Install dependecies"?
such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml
Thanks!

from albef.

Junjie-Ye avatar Junjie-Ye commented on May 16, 2024

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

from albef.

haoshuai714 avatar haoshuai714 commented on May 16, 2024

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

[email protected]

from albef.

Junjie-Ye avatar Junjie-Ye commented on May 16, 2024

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

[email protected]

I have already sent the document to your e-mail. Looking forward to your reply. Thank you!

from albef.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.