mmendiet / gfm Goto Github PK

View Code? Open in Web Editor NEW

60.0 60.0 9.0 255 KB

License: Apache License 2.0

Python 100.00%

gfm's People

Contributors

Stargazers

Watchers

Forkers

jamesz101ece hoteryoung lidingcas boranhan

gfm's Issues

gradient overflow

Hi, congrats for your great study.
I came into the problem that the gradient sometimes overflowed (see the python snippet below) when I tried to pre-train the GFM mode using GeoTile.

[2024-01-24 14:52:40 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][700/1148]      eta 0:07:43 lr 0.000049 time 0.2130 (1.0337)    los
s -0.6558 (-0.6616)     grad_norm 0.6889 (0.7676)       mem 11705MB
[2024-01-24 14:54:19 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][800/1148]      eta 0:05:57 lr 0.000049 time 0.2107 (1.0273)    los
s -0.6804 (-0.6620)     grad_norm 0.5012 (0.7538)       mem 11705MB
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
[2024-01-24 14:56:02 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][900/1148]      eta 0:04:14 lr 0.000049 time 0.2089 (1.0278)    los
s -0.6462 (-0.6619)     grad_norm 0.4569 (inf)  mem 11705MB
[2024-01-24 14:57:42 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][1000/1148]     eta 0:02:31 lr 0.000049 time 0.2100 (1.0249)    los
s -0.7272 (-0.6625)     grad_norm 0.5404 (inf)  mem 11705MB
[2024-01-24 14:59:23 simmim_pretrain](main_teacher.py 224): INFO Train: [10/100][1100/1148]     eta 0:00:49 lr 0.000049 time 0.2098 (1.0242)    los
s -0.6921 (-0.6626)     grad_norm 0.5038 (inf)  mem 11705MB
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0

I wonder if this situation happened during your pretraining stage.

FYI
I trained on 2 nodes each with 8 GPUS with batch size set to 32 for the limit of GPU storage.
The BASE_LR and MIN_LR are also shrunk 128/32= 4 fold.

best wishes
Hoter Young

code release

great work！
will you release the code and model？

one more question, have you ever trained GeoPile with framework introduced by DMAE paper?

Hi, congrats for your excellent job.
I wonder whether and when the finetune&test ymls/scripts for all downstream tasks will be opened.
Cause currently only the finetune ymls for UCM and BEN are issued.

Best wishes
Hoter Young

Segmentation code

Do you have segmentation finetuning code available?

About segmentation task I have a problem?

Hi Author,
Thanks you open your work to the github, I read the readme and the paper to find the segmentation code to launch the picture segmentation, But don't find it. So Can you tell me about how to inference segmentation task?

Change detection and segmentation tasks code

Hello, do you plan to release the code for the experiments on change detection and segmentation tasks described in the paper? I think it would be beneficial for the community.
Thanks

Pretrained weights

A great job.

Will you release pre training weights?

SimMIM pre-training script: error: the following arguments are required: --local_rank

Dear Community,

I tried to run the provided basic command for pretraining python -m torch.distributed.launch --nproc_per_node 8 main_teacher.py
--cfg configs/simmim_pretrain__swin_base__img192_window6__100ep.yaml --batch-size 128
--data-path ~/GeoPile/GeoPileV0/ --tag gfm --pretrained ~/output/simmim_finetune/swin_base_patch4_window7_224_22k.pth
However, I'm getting the below error
_

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
usage: SimMIM pre-training script --cfg FILE [--opts OPTS [OPTS ...]] [--batch-size BATCH_SIZE] [--data-path DATA_PATH] [--pretrained PRETRAINED] [--resume RESUME]
[--accumulation-steps ACCUMULATION_STEPS] [--use-checkpoint] [--amp-opt-level {O0,O1,O2}] [--output PATH] [--tag TAG] [--alpha ALPHA]
--local_rank LOCAL_RANK
SimMIM pre-training script: error: the following arguments are required: --local_rank
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 44681) of binary: /home/abd037/anaconda3/envs/SimMIM/bin/python
Traceback (most recent call last):
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/abd037/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_teacher.py FAILED

Failures:
[1]:
time : 2023-11-22_15:21:44
host : ??????
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 44682)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-11-22_15:21:44
host : ??????
rank : 2 (local_rank: 2)
exitcode : 2 (pid: 44683)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-11-22_15:21:44
host : ??????
rank : 3 (local_rank: 3)
exitcode : 2 (pid: 44684)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-11-22_15:21:44
host : ??????
rank : 4 (local_rank: 4)
exitcode : 2 (pid: 44685)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2023-11-22_15:21:44
host : ??????
rank : 5 (local_rank: 5)
exitcode : 2 (pid: 44686)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2023-11-22_15:21:44
host : ??????r
rank : 6 (local_rank: 6)
exitcode : 2 (pid: 44687)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2023-11-22_15:21:44
host : ??????
rank : 7 (local_rank: 7)
exitcode : 2 (pid: 44688)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-11-22_15:21:44
host : ??????
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 44681)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

_
I do appreciate your help.

Kind Regards,
Mah

unfreeze ImageNet teacher

Hi! This is indeed an interesting work. I wonder if you have done any experiment where you unfreeze the ImageNet teacher in GFM and train everything?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.