GithubHelp home page GithubHelp logo

gfm's People

Contributors

mmendiet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gfm's Issues

gradient overflow

Hi, congrats for your great study.
I came into the problem that the gradient sometimes overflowed (see the python snippet below) when I tried to pre-train the GFM mode using GeoTile.

[2024-01-24 14:52:40 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][700/1148]      eta 0:07:43 lr 0.000049 time 0.2130 (1.0337)    los
s -0.6558 (-0.6616)     grad_norm 0.6889 (0.7676)       mem 11705MB
[2024-01-24 14:54:19 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][800/1148]      eta 0:05:57 lr 0.000049 time 0.2107 (1.0273)    los
s -0.6804 (-0.6620)     grad_norm 0.5012 (0.7538)       mem 11705MB
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
[2024-01-24 14:56:02 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][900/1148]      eta 0:04:14 lr 0.000049 time 0.2089 (1.0278)    los
s -0.6462 (-0.6619)     grad_norm 0.4569 (inf)  mem 11705MB
[2024-01-24 14:57:42 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][1000/1148]     eta 0:02:31 lr 0.000049 time 0.2100 (1.0249)    los
s -0.7272 (-0.6625)     grad_norm 0.5404 (inf)  mem 11705MB
[2024-01-24 14:59:23 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][1100/1148]     eta 0:00:49 lr 0.000049 time 0.2098 (1.0242)    los
s -0.6921 (-0.6626)     grad_norm 0.5038 (inf)  mem 11705MB
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0

I wonder if this situation happened during your pretraining stage.

FYI
I trained on 2 nodes each with 8 GPUS with batch size set to 32 for the limit of GPU storage.
The BASE_LR and MIN_LR are also shrunk 128/32= 4 fold.

best wishes
Hoter Young

code release

great work!
will you release the code and model?

one more question, have you ever trained GeoPile with framework introduced by DMAE paper?

test code

Hi, congrats for your excellent job.
I wonder whether and when the finetune&test ymls/scripts for all downstream tasks will be opened.
Cause currently only the finetune ymls for UCM and BEN are issued.

Best wishes
Hoter Young

About segmentation task I have a problem?

Hi Author,
Thanks you open your work to the github, I read the readme and the paper to find the segmentation code to launch the picture segmentation, But don't find it. So Can you tell me about how to inference segmentation task?

Change detection and segmentation tasks code

Hello, do you plan to release the code for the experiments on change detection and segmentation tasks described in the paper? I think it would be beneficial for the community.
Thanks

SimMIM pre-training script: error: the following arguments are required: --local_rank

Dear Community,

I tried to run the provided basic command for pretraining python -m torch.distributed.launch --nproc_per_node 8 main_teacher.py
--cfg configs/simmim_pretrain__swin_base__img192_window6__100ep.yaml --batch-size 128
--data-path ~/GeoPile/GeoPileV0/ --tag gfm --pretrained ~/output/simmim_finetune/swin_base_patch4_window7_224_22k.pth

However, I'm getting the below error
_

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 44681) of binary: /home/abd037/anaconda3/envs/SimMIM/bin/python
Traceback (most recent call last):
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_teacher.py FAILED

Failures:
[1]:
time : 2023-11-22_15:21:44
host : ??????
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 44682)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-11-22_15:21:44
host : ??????
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 44683)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-11-22_15:21:44
host : ??????
rank : 3 (local_rank: 3)
exitcode : 2 (pid: 44684)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-11-22_15:21:44
host : ??????
rank : 4 (local_rank: 4)
exitcode : 2 (pid: 44685)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2023-11-22_15:21:44
host : ??????
rank : 5 (local_rank: 5)
exitcode : 2 (pid: 44686)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2023-11-22_15:21:44
host : ??????r
rank : 6 (local_rank: 6)
exitcode : 2 (pid: 44687)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2023-11-22_15:21:44
host : ??????
rank : 7 (local_rank: 7)
exitcode : 2 (pid: 44688)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-11-22_15:21:44
host : ??????
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 44681)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

_
I do appreciate your help.

Kind Regards,
Mah

unfreeze ImageNet teacher

Hi! This is indeed an interesting work. I wonder if you have done any experiment where you unfreeze the ImageNet teacher in GFM and train everything?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.