hszhao / semseg Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 244.0 1.42 MB

Semantic Segmentation in Pytorch

License: MIT License

Python 84.21% C++ 7.81% Cuda 6.76% Shell 1.22%

semseg's People

Contributors

Stargazers

Watchers

Forkers

suyanzhou626 pinglmlcv advancer-debug allenmao wutianyirosun alzayats yaoliuoa lijiunderstand shaoqibnu shiyuan0806 rsip4sh lailuboy leo-xxx cclauss wwz4436 shannongxn xiaohangzhan sleepingidea nnu-gisa emmasrh yangsenwxy bdstars sj-li elaineok janysunny huayuuu yifei87 dreadlord1984 wulich openseg-group jlhou irfanicmll zhushaoquan zijundeng chiukin arnegroskurth ljwdust upgirlnana sujinjang deepparrot zxr1314 soulempty chuong j12138 sweden1003 jiananli2016 qtjiebin louyanyang menghaoguo kpzhang93 ljmiao ginobilinie yangchuancv can-song hkchengrex qxdaaaaa lxmwust x-lai oaix fuuuyuuu hvning ets-research-repositories visionresearch trevolan77 gladcolor jacksky64 cylvlyl alex-shilei ixtiyoruz lliai csldragon wohaiyo shuxiangguo omegafragger jackroos alandszhang yiyechen minygd chl916185 wanglixilinx strawsyz weisili2016 hajungong007 wojiaoyanmin shuai-xie ylsxw shalevy1 lph529372693 1960675737 cosmoshua xiamengqing chexuanqiao zhzxlcc gisuhwang0312 huangyingsong baibaidj myegos klq-y visenzeadam whatsups

semseg's Issues

Cityscapes: PSPNet50 | 0.7730/0.8431/0.9597. | 0.7838/0.8486/0.9617

What is the difference between these two results “Cityscapes: PSPNet50 | 0.7730/0.8431/0.9597. | 0.7838/0.8486/0.9617.“ Why one is higher?

Could semseg run on multi nodes?

@hszhao
Is distributed training supported?

If the answer is yes, could you give me some instructions to run on multi nodes?

Thank you!

Multi-gpu inference

I am interested in doing single node multi-gpu inference. Pytorch dataparallel does not allow for inputs of variable size. In the PSPNet code separete chunks of data are feeded using parfor. Do you have any suggestions for doing multi-gpu inference?

Documentation about running the demo on CPU

Although REAME describes well the necessary steps to get a prediction demo, I think it would be nice to add some information about doing the same without GPU. Maybe the best way would be changing tool/demo.py to support some additional CLI argument, but documenting the necessary changes would already be an improvement.

These changes were sufficient to make the demo work without GPU:

diff --git a/tool/demo.py b/tool/demo.py
index 6014081..1ae1bd3 100755
--- a/tool/demo.py
+++ b/tool/demo.py
@@ -92,11 +92,11 @@ def main():
                        normalization_factor=args.normalization_factor, psa_softmax=args.psa_softmax,
                        pretrained=False)
     logger.info(model)
-    model = torch.nn.DataParallel(model).cuda()
-    cudnn.benchmark = True
+    model = torch.nn.DataParallel(model)
+    cudnn.benchmark = False
     if os.path.isfile(args.model_path):
         logger.info("=> loading checkpoint '{}'".format(args.model_path))
-        checkpoint = torch.load(args.model_path)
+        checkpoint = torch.load(args.model_path, map_location=torch.device('cpu'))
         model.load_state_dict(checkpoint['state_dict'], strict=False)
         logger.info("=> loaded checkpoint '{}'".format(args.model_path))
     else:
@@ -112,7 +112,7 @@ def net_process(model, image, mean, std=None, flip=True):
     else:
         for t, m, s in zip(input, mean, std):
             t.sub_(m).div_(s)
-    input = input.unsqueeze(0).cuda()
+    input = input.unsqueeze(0)
     if flip:
         input = torch.cat([input, input.flip(3)], 0)
     with torch.no_grad():

In addition, recommending the installation of PyYAML and opencv-python (not official, but very handy) as part of the minimum requirements (tensorboardX and apex are not really necessary for a demo, right?) could also help those trying to do some quick tests.

OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).

When trying to use multithread, the following error occurs:

OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
Traceback (most recent call last):                                                                                                                     │··································
  File "tool/train.py", line 456, in <module>                                                                                                          │··································
    main()                                                                                                                                             │··································
  File "tool/train.py", line 106, in main                                                                                                              │··································
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))                                                                │··································
  File "/home/lzx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn                             │··································
    while not spawn_context.join():                                                                                                                    │··································
  File "/home/lzx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join                              │··································
    (error_index, name)                                                                                                                                │··································
Exception: process 2 terminated with signal SIGABRT

This will happen when multiprocessing_distributed and use_apex are True.
Configuration details:

TRAIN:
  arch: psp
  layers: 101
  sync_bn: False  # adopt syncbn or not
  train_h: 585
  train_w: 585
  scale_min: 0.5  # minimum random scale
  scale_max: 2.0  # maximum random scale
  rotate_min: -10  # minimum random rotate
  rotate_max: 10  # maximum random rotate
  zoom_factor: 8  # zoom factor for final prediction during training, be in [1, 2, 4, 8]
  ignore_label: 255
  aux_weight: 0.4
  train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
  workers: 16  # data loader workers
  batch_size: 16  # batch size for training
  batch_size_val: 8  # batch size for validation during training, memory and speed tradeoff
  base_lr: 0.01
  epochs: 200
  start_epoch: 0
  new_epoch: 300  # for resume, how many new epochs to train
  power: 0.9
  momentum: 0.9
  weight_decay: 0.0001
  manual_seed: 520
  print_freq: 10
  save_freq: 1
  save_path: exp/cityscapes/pspnet101/model
  weight:  # path to initial weight (default: none)
  resume:  # path to latest checkpoint (default: none)
  evaluate: True # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
  dist_url: tcp://127.0.0.1:6789
  dist_backend: 'nccl'
  multiprocessing_distributed: True
  world_size: 1
  rank: 0
  use_apex: True
  opt_level: 'O0'
  keep_batchnorm_fp32:
  loss_scale:

I have tried to do some search but they are not working.

Could you please give me some advice?

Thanks a lot!

Doesn't this override all the grided predictions?

Doesnt this override predictions?

semseg/tool/demo.py

Line 181 in dae0274

 prediction = scale_process(model, image_scale, classes, crop_h, crop_w, h, w, mean, std) 

Getting Low mIOU on pretrained models.

When performing evaluation on the Cityscapes validation set using the provided pspnet_resnet101 pretrained model, I get a much lower result than stated. Anything im missing?

[2019-09-20 22:33:20,678 INFO test.py line 249 24510] Eval result: mIoU/mAcc/allAcc 0.0039/0.0165/0.0077. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_0 result: iou/accuracy 0.0000/0.0001, name: road. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_1 result: iou/accuracy 0.0051/0.0112, name: sidewalk. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_2 result: iou/accuracy 0.0149/0.1796, name: building. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_3 result: iou/accuracy 0.0032/0.0045, name: wall. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_4 result: iou/accuracy 0.0350/0.0555, name: fence. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_5 result: iou/accuracy 0.0097/0.0398, name: pole. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_6 result: iou/accuracy 0.0000/0.0000, name: traffic light. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_7 result: iou/accuracy 0.0001/0.0001, name: traffic sign. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_8 result: iou/accuracy 0.0008/0.0035, name: vegetation. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_9 result: iou/accuracy 0.0030/0.0087, name: terrain. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_10 result: iou/accuracy 0.0000/0.0000, name: sky. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_11 result: iou/accuracy 0.0015/0.0016, name: person. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_12 result: iou/accuracy 0.0004/0.0005, name: rider. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_13 result: iou/accuracy 0.0005/0.0046, name: car. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_14 result: iou/accuracy 0.0000/0.0000, name: truck. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_15 result: iou/accuracy 0.0001/0.0013, name: bus. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_16 result: iou/accuracy 0.0000/0.0000, name: train. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_17 result: iou/accuracy 0.0005/0.0005, name: motorcycle. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_18 result: iou/accuracy 0.0003/0.0013, name: bicycle.

inference time

Hi,

What is model speed in terms of interference time (i.e frames per second) and what is image size in that case?

Thanks,

I don't know the format of the contents of train_list or test_list corresponding file

Why do we have 8x upsampling in the end in PSPNet?

8x bilinear upsampling is a non-learnable operation and doing it in the end seems very unintuitive.
I can see the results as shown in this mage.

left is ground truth and right are predictions.
As you can see the small features are not being detected, i.e the building and the roads, and blobs are coming up.

question about psanet

Iin psamask.py, i think the code about the output is the same whether self.psa_type==1 or 2.

                if self.psa_type == 0:  # col
                    output[n, :, h, w] = mask_com.view(-1)
                else:  # dis
                    c = h * feature_W_ + w
                    output[n, c, :, :] = mask_com.view(feature_H_, feature_W_)
    if self.psa_type == 1:  # dis
        output = output.view(num_, feature_H_ * feature_W_, feature_H_ * feature_W_).transpose(1, 2).view(num_, feature_H_ * feature_W_, feature_H_, feature_W_)

because using .transpose and .view equal to output[n, :, h, w] = mask_com.view(-1)

No speed up when using larger batch size

I run this code on a 2xTesla p100 machine with pytorch1.5.
And the training time keeps the same no matter how much I set the batch size.
For example, if I set bs to 4, then it took one second per-iteration; and if I set bs to 16, it took 4 seconds per-iteration. Shouldn't it still be 1 second?
Is there something wrong with the multiprocessing part?

uploaded pth model-weigths for PSPNet ResNet50/101 CityScapes?

Any chance you could upload trained models weights on cityscapes dataset for PSPNet ResNet50/101 on google cloud?
I have no GPU and just want to do the inference on the new data

Thanks,
Tanya

The correspondence between camvid_colors.txt and camvid_names.txt is wrong

for example, the color of "buliding" should be "128,0,0", rather than "0,0,0" in camvid_colors.txt.

some question about test.py ?

I cant' understand the mean about follow code:

    def scale_process(self, image, crop_h, crop_w, h, w, mean, std=None, stride_rate=2 / 3):
        ori_h, ori_w, _ = image.shape
        # 填充周边
        pad_h = max(crop_h - ori_h, 0)
        pad_w = max(crop_w - ori_w, 0)
        pad_h_half = int(pad_h / 2)
        pad_w_half = int(pad_w / 2)
        if pad_h > 0 or pad_w > 0:
            image = cv2.copyMakeBorder(image, pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half,
                                       cv2.BORDER_CONSTANT, value=mean)
        new_h, new_w, _ = image.shape
        # FAQ:
        stride_h = int(np.ceil(crop_h * stride_rate))
        stride_w = int(np.ceil(crop_w * stride_rate))
        grid_h = int(np.ceil(float(new_h - crop_h) / stride_h) + 1)
        grid_w = int(np.ceil(float(new_w - crop_w) / stride_w) + 1)
        prediction_crop = np.zeros((new_h, new_w, self.numclass), dtype=float)
        count_crop = np.zeros((new_h, new_w), float)
        for index_h in range(0, grid_h):
            for index_w in range(0, grid_w):
                s_h = index_h * stride_h
                e_h = min(s_h + crop_h, new_h)
                s_h = e_h - crop_h
                s_w = index_w * stride_w
                e_w = min(s_w + crop_w, new_w)
                s_w = e_w - crop_w
                image_crop = image[s_h:e_h, s_w:e_w].copy()
                count_crop[s_h:e_h, s_w:e_w] += 1
                prediction_crop[s_h:e_h, s_w:e_w, :] += self.net_process(image_crop, mean, std)
        prediction_crop /= np.expand_dims(count_crop, 2)
        ..........

i know that it need to process with the train height、width,But I dont't understand! Appreciate with any suggest!

PSANet is stuck in modeling

I have tried to train PSPNet50 and PSANet50.
PSPNet has been successfully trained, but PSANet is stuck in building up model.
Has anyone encountered similar problem?

How to set up data and labels

for example:
I have data and labels, Just identify the building from the picture.so labe just have backdrop and building,its 0 and 255.I revise the txt file in dataset.I put then in right position and created a right txt file, but the loss very big or little,
misconvergence .I dont know why?
because I just have the building data and labels?
And when I test, found a color file and a gray file, I have no idea what that means?

can you tell me how to do please?

Pascal VOC transform GT to all 255

I found a bug in transform.py but not sure about where the bug is. The loaded semantic segmentation label sometimes turns to all 255 after performing transformation. The code for transformation is:

train_transform = transform.Compose([ transform.RandScale([args.scale_min, args.scale_max]), transform.RandRotate([args.rotate_min, args.rotate_max], padding=mean, ignore_label=args.ignore_label), transform.RandomGaussianBlur(), transform.RandomHorizontalFlip(), transform.Crop([args.train_h, args.train_w], crop_type='rand', padding=mean, ignore_label=args.ignore_label), transform.ToTensor(), transform.Normalize(mean=mean, std=std)])

and the configuration is:

train_h: 513 train_w: 513 scale_min: 0.5 # minimum random scale scale_max: 2.0 # maximum random scale rotate_min: -10 # minimum random rotate rotate_max: 10 # maximum random rotate

Object detection and Instance segmentation support

Hi,

I just to want to confirm, will the same code work for detection and instance segmentation if i train using bdd100k data set?

It would be great if you can confirm. Also highlight how can we add instance segmentation and detection if not already?

Thanks,

Training is slow

Hi, Thank you for sharing the code! The work is very exciting and the code is elegantly written.
Here is a question. I trained on cityscapes resnet101 with the default setting except batch_size is set to 8. For 14 hours I only trained 96epochs. It trains quickly at the beginning and then gets slower and slower.
Hardware information: 2 v100 GPUs(32G GPU memory). The loss does not converge well.

About loading pretrained resnet into PSPNet

The lower part of PSPNet is slightly different from that of ResNet. PSPnet has three layers of {conv, relu, bn} while ResNet only has one layer. Did you re-train the new-ResNet (with 3 {conv, relu, bn} layers) in ImageNet dataset? I see in the code that you load a model called "ResNet50_v2", is it the re-trained model?

Can we get BDD weights?

Would it be possible to get PSA-net weights for Berkley Deep Drive data?

Undefined name 'logger' in util/config.py

flake8 testing of https://github.com/hszhao/semseg on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./util/config.py:164:9: F821 undefined name 'logger'
        logger.debug(msg)
        ^
1     F821 undefined name 'logger'
1

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

F821: undefined name name
F822: undefined name name in __all__
F823: local variable name referenced before assignment
E901: SyntaxError or IndentationError
E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

about how do we choose the right epochs

hi,how should we choose the right epochs to make sure that our train is the best, because in semseg,it save the last two .pth of train epochs, how do we konw the right epoch, please reply when you see it,
thanks

train error

when I run the train.sh

sometimes error is:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

sometimes error is:
RuntimeError: cuDNN error: CUDNN_STATUS_ALLOC_FAILED

and maybe
RuntimeError: CUDA error: device-side assert triggered

I use
pytorch 1.1
cuda 9.0
cudnn 7.1.4

some problem with resize

when I used train images with resize(10242048),randomcrop size(5121024),I can get the performance in valsets with 76.35,but when I changed resize(5121024),randomcrop(384768)，I only can get the performance in valsets with 72. can you help me?(bencause I have to use the setting for the future work),thank you very much

Would this repo support ICNet?

Question about PsaNet

In the code, what is the function of psamask module, how to understand this module?

Issue: from util import dataset, transform, config

hi
When i run train.py. Ihave an issue like this:
File "train.py", line 21, in
from util import dataset, transform, config
ModuleNotFoundError: No module named 'util'
best regards,
PeterPham

Resume training performance drop

Hi, I'm trying to use your fantastic code but encountered something confused when I try to load a checkpoint of 200 epoch training.

The mIoU of training reached 0.739 at epoch 200. But when I try to load this checkpoint and continually training, the performance dropped from 0.739 to 0.563 at epoch 201.

It's quite hard for me to understand this phenomenon, could you give me some advices?
Thank you so much!

Here is my config file:

DATA:
  data_root: /datasets-ssd/cityscapes
  train_list: dataset/cityscapes/list/fine_train.txt
  val_list: dataset/cityscapes/list/fine_val.txt
  classes: 19

TRAIN:
  arch: psp
  layers: 101
  sync_bn: True  # adopt syncbn or not
  train_h: 513
  train_w: 513
  scale_min: 0.5  # minimum random scale
  scale_max: 2.0  # maximum random scale
  rotate_min: -10  # minimum random rotate
  rotate_max: 10  # maximum random rotate
  zoom_factor: 8  # zoom factor for final prediction during training, be in [1, 2, 4, 8]
  ignore_label: 255
  aux_weight: 0.4
#  train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
  train_gpu: [0]
  workers: 6  # data loader workers
  batch_size: 2  # batch size for training
  batch_size_val: 1  # batch size for validation during training, memory and speed tradeoff
  base_lr: 0.01
  epochs: 400
  start_epoch: 200
  power: 0.9
  momentum: 0.9
  weight_decay: 0.0001
  manual_seed: 520
  print_freq: 10
  save_freq: 1
  save_path: exp/cityscapes/pspnet101/model
  weight:  # path to initial weight (default: none)
  resume:  exp/cityscapes/pspnet101/model/train_epoch_200.pth # path to latest checkpoint (default: none)
  evaluate: False  # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
  distributed: True
  dist_url: tcp://127.0.0.1:6789
  dist_backend: 'nccl'
  multiprocessing_distributed: True
  world_size: 1
  rank: 0
  use_apex: True
  opt_level: 'O0'
  keep_batchnorm_fp32:
  loss_scale:

TEST:
  test_list: dataset/cityscapes/list/fine_val.txt
  split: val  # split in [train, val and test]
  base_size: 2048  # based size for scaling
  test_h: 713
  test_w: 713
  scales: [1.0]  # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
  has_prediction: False  # has prediction already or not
  index_start: 0  # evaluation start index in list
  index_step: 0  # evaluation step index in list, 0 means to end
  test_gpu: [0]
  model_path: exp/cityscapes/pspnet101/model/train_epoch_400.pth  # evaluation model path
  save_folder: exp/cityscapes/pspnet101/result/epoch_400/val/ss  # results save folder
  colors_path: dataset/cityscapes/cityscapes_colors.txt  # path of dataset colors
  names_path: dataset/cityscapes/cityscapes_names.txt  # path of dataset category names

My environment:

2080ti, 12G
CUDA 10.0
python 3.6

Hoping for your reply!
Thank you again!

Image shape requirement

My customized dataset image.shape=(3, 512, 512).
In psanet.py line 157

assert(x_size[2]-1)%8 == 0 and (s_size[3]-1)%8==0
h =  int((x_size[2] - 1)  / 8 * self.zoom_factor + 1)
w = int((x_size[3] - 1) /8 * self.zoom_factor + 1)

How should I modification the code to work ?
Ps:
if I just comment line 157, it will throw a RuntimeError in pasnet.py line 98

invalid argument 0: Sizes of tensors must match except in dimension1, Got 63 and 64 in dimension 2

RuntimeError: CUDA out of memory.

Totally 20210 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
Traceback (most recent call last):
  File "tool/train.py", line 426, in <module>
    main()
  File "tool/train.py", line 107, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 236, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 281, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/Desktop/semseg/model/pspnet.py", line 91, in forward
    x = self.layer4(x_tmp)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/Desktop/semseg/model/resnet.py", line 87, in forward
    out = self.bn3(out)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
    exponential_average_factor, self.eps)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1656, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 450.00 MiB (GPU 0; 23.65 GiB total capacity; 22.42 GiB already allocated; 105.12 MiB free; 311.16 MiB cached)

I tried to run every configuration in this project with one GPU (Titan RTX with 24GB memory), but it shows CUDA out of memory, which is wired because there are no other programs using the GPU. Do you have any suggestions for this issue?

RuntimeError: => no checkpoint found at 'exp/ade20k/pspnet50/model/train_epoch_100.pth'

I got the problem when I first sh tool/train.sh ade20k pspnet50 .

Traceback (most recent call last):
File "tool/test.py", line 255, in
main()
File "tool/test.py", line 117, in main
raise RuntimeError("=> no checkpoint found at '{}'".format(args.model_path))
RuntimeError: => no checkpoint found at 'exp/ade20k/pspnet50/model/train_epoch_100.pth'

DATA:
data_root: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k
train_list: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k/list/training.txt
val_list: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k/list/validation.txt
classes: 2

TRAIN:
arch: psp
layers: 50
sync_bn: True # adopt sync_bn or not
train_h: 473
train_w: 473
scale_min: 0.5 # minimum random scale
scale_max: 2.0 # maximum random scale
rotate_min: -10 # minimum random rotate
rotate_max: 10 # maximum random rotate
zoom_factor: 8 # zoom factor for final prediction during training, be in [1, 2, 4, 8]
ignore_label: 255
aux_weight: 0.4
train_gpu: [5, 6]
workers: 16 # data loader workers
batch_size: 16 # batch size for training
batch_size_val: 8 # batch size for validation during training, memory and speed tradeoff
base_lr: 0.01
epochs: 100
start_epoch: 0
power: 0.9
momentum: 0.9
weight_decay: 0.0001
manual_seed:
print_freq: 10
save_freq: 1
save_path: exp/ade20k/pspnet50/model
weight: # path to initial weight (default: none)
resume: # path to latest checkpoint (default: none)
evaluate: False # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
dist_url: tcp://127.0.0.1:6789
dist_backend: 'nccl'
multiprocessing_distributed: True
world_size: 1
rank: 0
use_apex: True
opt_level: 'O0'
keep_batchnorm_fp32:
loss_scale:

TEST:
test_list: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k/list/validation.txt
split: val # split in [train, val and test]
base_size: 512 # based size for scaling
test_h: 473
test_w: 473
scales: [1.0] # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
has_prediction: False # has prediction already or not
index_start: 0 # evaluation start index in list
index_step: 0 # evaluation step index in list, 0 means to end
test_gpu: [5]
model_path: exp/ade20k/pspnet50/model/train_epoch_100.pth # evaluation model path
save_folder: exp/ade20k/pspnet50/result/epoch_100/val/ss # results save folder
colors_path: dataset/ade20k/ade20k_colors.txt # path of dataset colors
names_path: dataset/ade20k/ade20k_names.txt # path of dataset category names

Getting lower mIoU in the cityscape

Hi, I used the checkpoint downloaded from the provided GoogleDrive---pspnet/train-epoch-200 to test the cityscape val dataset using 'sh tool/test.sh cityscapes pspnet50',
but I got very low mIoU: Eval result: mIoU/mAcc/allAcc 0.0038/0.0150/0.0069.
After I saw the colored predicted picture, I find the satisfactory results. It seems the gray predicted picture is painted with wrong ids !
can you tell me how to fix it ?

A little mistake in intersectionAndUnion

Hi!
I use this code to train a model on Camvid, and the ignore_label is 11 on this dataset.
I found here is a little mistake in the testing code.
We should add a line in the 'intersectionAndUnion' in 'util/utils.py', and also add
'intersection, union, target = intersectionAndUnion(pred, target, classes**,ignore_index=args.ignore_label**)' in tool/test.py

def intersectionAndUnion(output, target, K, ignore_index=255):
# 'K' classes, output and target sizes are N or N * L or N * H * W, each value in range 0 to K - 1.
assert (output.ndim in [1, 2, 3])
assert output.shape == target.shape
output = output.reshape(output.size).copy()
target = target.reshape(target.size)
output[np.where(target == ignore_index)[0]] = 255
target[np.where(target == ignore_index)[0]] = 255
intersection = output[np.where(output == target)[0]]
area_intersection, _ = np.histogram(intersection, bins=np.arange(K+1))
area_output, _ = np.histogram(output, bins=np.arange(K+1))
area_target, _ = np.histogram(target, bins=np.arange(K+1))
area_union = area_output + area_target - area_intersection
return area_intersection, area_union, area_target

About train on my own datasets.

Hi~! I want to train PSPNet on my own datasets. My own datasets: original images + gray label images(I have set the useful class as 0-(C-1) and the ignoring class as 255. The gray value of each pixel is the class number).

Now I want to know how to set up the $DATASET$_colors.txt? Can I set up as follows?
1
2
3
...
Hope for your reply, thanks!

definition of ppm

Hi, thank you for the open-source, I wanted to ask you about ppm, in the paper you did not mention it, what is the purpose of it ?. Is using it good or bad?

How to train with the model with the given scripts?

When I attempt to train the code with the given script, I get the following errors:

tool/train.sh: 15: tool/train.sh: -u: not found
tool/train.sh: 19: tool/train.sh: -u: not found

Could you give me your hand?
I am using conda.

Thanks

I want to the pre-trained model(.pth)

hello,author. I want to the pre-trained model. I have no GPU, so I want to it to test.
If you have pre-trained model,please contact me([email protected])
Thanks a lot.

how can i use the code in my own datasets

thanks for your excellent jobs on this project. Now, I want to know how can i use this code on my own datasets? What is the structure of your data?

AttributeError: module 'apex' has no attribute 'parallel'

    BatchNorm = apex.parallel.SyncBatchNorm
AttributeError: module 'apex' has no attribute 'parallel'

Here is the config detail:

TRAIN:
  arch: pspnet
  layers: 101
  sync_bn: True  # adopt syncbn or not
  train_h: 713
  train_w: 713
  scale_min: 0.5  # minimum random scale
  scale_max: 2.0  # maximum random scale
  rotate_min: -10  # minimum random rotate
  rotate_max: 10  # maximum random rotate
  zoom_factor: 8  # zoom factor for final prediction during training, be in [1, 2, 4, 8]
  ignore_label: 255
  aux_weight: 0.4
#  train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
  train_gpu: [2,3]
  workers: 12  # data loader workers
  batch_size: 4  # total batch size for training
  batch_size_val: 1  # batch size for validation during training, memory and speed tradeoff
  base_lr: 0.01
  epochs: 300
  start_epoch: 0
  power: 0.9
  momentum: 0.9
  weight_decay: 0.0001
  manual_seed: 520
  print_freq: 10
  save_freq: 1
  save_path: exp/cityscapes/pspnet101/model
  weight:  # path to initial weight (default: none)
  resume:  # path to latest checkpoint (default: none)
  evaluate: True  # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
  distributed: True
  dist_url: tcp://127.0.0.1:6789
  dist_backend: 'nccl'
  multiprocessing_distributed: True
  world_size: 1
  rank: 0
  use_apex: True
  opt_level: 'O0'
  keep_batchnorm_fp32:
  loss_scale:

Can you help me with this apex problem?

Assertion `t >= 0 && t < n_classes` failed

Thanks for sharing your codes! I was not able to run the training code for Cityscapes dataset. Below you can see my configuration file and the error messages. This seems to be related to the labels being out of range. I looked at your loader (SemData). It reads the label files (in my case color label files from Cityscapes dataset) but does not convert them to the range [0, n_classes-1]. Could you have a look at it? Thanks a lot!

DATA:
data_root: /content/CityScapes_modified
train_list: dataset/cityscapes/fine_train.txt
val_list: dataset/cityscapes/fine_val.txt
classes: 19

TRAIN:
arch: psp
layers: 50
sync_bn: True # adopt syncbn or not
train_h: 713
train_w: 713
scale_min: 0.5 # minimum random scale
scale_max: 2.0 # maximum random scale
rotate_min: -10 # minimum random rotate
rotate_max: 10 # maximum random rotate
zoom_factor: 8 # zoom factor for final prediction during training, be in [1, 2, 4, 8]
ignore_label: 255
aux_weight: 0.4
train_gpu: [0]
workers: 4 # data loader workers
batch_size: 2 # batch size for training
batch_size_val: 2 # batch size for validation during training, memory and speed tradeoff
base_lr: 0.01
epochs: 200
start_epoch: 0
power: 0.9
momentum: 0.9
weight_decay: 0.0001
manual_seed:
print_freq: 10
save_freq: 1
save_path: exp/cityscapes/pspnet50/model
weight: # path to initial weight (default: none)
resume: # path to latest checkpoint (default: none)
evaluate: False # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
dist_url: tcp://127.0.0.1:6789
dist_backend: 'nccl'
multiprocessing_distributed: False
world_size: 1
rank: 0
use_apex: True
opt_level: 'O0'
keep_batchnorm_fp32:
loss_scale:

TEST:
test_list: dataset/cityscapes/fine_val.txt
split: val # split in [train, val and test]
base_size: 2048 # based size for scaling
test_h: 713
test_w: 713
scales: [1.0] # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
has_prediction: False # has prediction already or not
index_start: 0 # evaluation start index in list
index_step: 0 # evaluation step index in list, 0 means to end
test_gpu: [0]
model_path: exp/dataset/cityscapes/pspnet50/model/train_epoch_200.pth # evaluation model path
save_folder: exp/dataset/cityscapes/pspnet50/result/epoch_200/val/ss # results save folder
colors_path: dataset/cityscapes/cityscapes_colors.txt # path of dataset colors
names_path: dataset/cityscapes/cityscapes_names.txt # path of dataset category names

#######################################################################
Error messages:
.
.
.
Totally 2975 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [160,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [161,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [162,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [163,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [164,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [165,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [166,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [167,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [168,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [256,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [257,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [258,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [259,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [260,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [261,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [262,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [263,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [264,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [265,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu line=127 error=710 : device-side assert triggered
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/semseg-master/tool/train.py", line 426, in
main()
File "/content/semseg-master/tool/train.py", line 107, in main
main_worker(args.train_gpu, args.ngpus_per_node, args)
File "/content/semseg-master/tool/train.py", line 236, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
File "/content/semseg-master/tool/train.py", line 281, in train
output, main_loss, aux_loss = model(input, target)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/content/semseg-master/model/pspnet.py", line 102, in forward
main_loss = self.criterion(x, y)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 916, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2009, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1840, in nll_loss
ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:127

Training performance drop on cityscapes with default parameters.

Training log.
[2019-06-14 18:10:31,725 INFO train.py line 154 10613] arch: psp
aux_weight: 0.4
base_lr: 0.01
base_size: 2048
batch_size: 16
batch_size_val: 1
classes: 19
colors_path: dataset/cityscapes/cityscapes_colors.txt
data_root: datasets/cityscapes/
dist_backend: nccl
dist_url: tcp://127.0.0.1:6789
distributed: True
epochs: 200
evaluate: False
has_prediction: False
ignore_label: 255
index_split: 5
index_start: 0
index_step: 0
keep_batchnorm_fp32: None
layers: 50
loss_scale: None
manual_seed: None
model_path: exp/cityscapes/pspnet50/model/train_epoch_200.pth
momentum: 0.9
multiprocessing_distributed: True
names_path: dataset/cityscapes/cityscapes_names.txt
ngpus_per_node: 8
opt_level: O0
power: 0.9
print_freq: 10
rank: 0
resume: None
rotate_max: 10
rotate_min: -10
save_folder: exp/cityscapes/pspnet50/result/epoch_200/val/ss
save_freq: 1
save_path: exp/cityscapes/pspnet50/model
scale_max: 2.0
scale_min: 0.5
scales: [1.0]
split: val
start_epoch: 0
sync_bn: True
test_gpu: [0]
test_h: 713
test_list: semseg-master/dataset/cityscapes/fine_val.txt
test_w: 713
train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
train_h: 713
train_list: semseg-master/dataset/cityscapes/fine_train.txt
train_w: 713
use_apex: True
val_list: semseg-master/dataset/cityscapes/fine_val.txt
weight: None
weight_decay: 0.0001
workers: 16
world_size: 8
zoom_factor: 8

Test log
[2019-06-16 08:08:38,259 INFO test.py line 249 23114] Eval result: mIoU/mAcc/allAcc 0.7695/0.8400/0.9603.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_0 result: iou/accuracy 0.9804/0.9881, name: road.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_1 result: iou/accuracy 0.8454/0.9255, name: sidewalk.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_2 result: iou/accuracy 0.9235/0.9677, name: building.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_3 result: iou/accuracy 0.5557/0.6280, name: wall.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_4 result: iou/accuracy 0.6037/0.6966, name: fence.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_5 result: iou/accuracy 0.6419/0.7436, name: pole.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_6 result: iou/accuracy 0.7032/0.8118, name: traffic light.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_7 result: iou/accuracy 0.7856/0.8581, name: traffic sign.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_8 result: iou/accuracy 0.9260/0.9675, name: vegetation.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_9 result: iou/accuracy 0.6553/0.7457, name: terrain.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_10 result: iou/accuracy 0.9460/0.9770, name: sky.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_11 result: iou/accuracy 0.8237/0.9218, name: person.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_12 result: iou/accuracy 0.6327/0.7595, name: rider.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_13 result: iou/accuracy 0.9490/0.9782, name: car.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_14 result: iou/accuracy 0.7279/0.7737, name: truck.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_15 result: iou/accuracy 0.8610/0.9364, name: bus.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_16 result: iou/accuracy 0.6388/0.6618, name: train.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_17 result: iou/accuracy 0.6448/0.7306, name: motorcycle.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_18 result: iou/accuracy 0.7760/0.8881, name: bicycle.

By the way, training time is much longer than your report. ( i use 8 P40)

I want to train pspnet50，but not succeed.

thx you code.when i input ‘sh tool/train.sh ade20k pspnet50’，i want to train pspnet50，but display the results of test .i don't know how change？Can you help me?
/semseg/config/ade20k/ade20k_pspnet101.yaml
DATA:
data_root: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/images/data/
train_list: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/training.txt
val_list: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/validation.txt
classes: 150

TRAIN:
arch: psp
layers: 50
sync_bn: True # adopt sync_bn or not
train_h: 473
train_w: 473
scale_min: 0.5 # minimum random scale
scale_max: 2.0 # maximum random scale
rotate_min: -10 # minimum random rotate
rotate_max: 10 # maximum random rotate
zoom_factor: 8 # zoom factor for final prediction during training, be in [1, 2, 4, 8]
ignore_label: 255
aux_weight: 0.4
train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
workers: 16 # data loader workers
batch_size: 16 # batch size for training
batch_size_val: 8 # batch size for validation during training, memory and speed tradeoff
base_lr: 0.01
epochs: 200
start_epoch: 101
power: 0.9
momentum: 0.9
weight_decay: 0.0001
manual_seed:
print_freq: 10
save_freq: 1
save_path: exp/ade20k/pspnet50/model
weight: # path to initial weight (default: none)
resume: exp/ade20k/pspnet50/model/train_epoch_100.pth # path to latest checkpoint (default: none)
evaluate: False # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
dist_url: tcp://127.0.0.1:6789
dist_backend: 'nccl'
multiprocessing_distributed: True
world_size: 1
rank: 0
use_apex: True
opt_level: 'O0'
keep_batchnorm_fp32:
loss_scale:

TEST:
test_list: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/validation.txt
split: val # split in [train, val and test]
base_size: 512 # based size for scaling
test_h: 473
test_w: 473
scales: [1.0] # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
has_prediction: False # has prediction already or not
index_start: 0 # evaluation start index in list
index_step: 0 # evaluation step index in list, 0 means to end
test_gpu: [0]
model_path: exp/ade20k/pspnet50/model/train_epoch_100.pth # evaluation model path
save_folder: exp/ade20k/pspnet50/result/epoch_100/val/ss # results save folder
colors_path: dataset/ade20k/ade20k_colors.txt # path of dataset colors
names_path: dataset/ade20k/ade20k_names.txt # path of dataset category names

training on bdd100k

Hi,

Can you please share data-loading code and configs to train on bdd100k?

Thanks,

Resnet training procedure

@hszhao, Did you train ResNet models on Imagenet dataset with dilated convolutions or use the kernels of original resnet model in PSPNet as dilated kernels?
I have used your pretrained resnet weights for initialising PSPNet and it's achieving published accuracy on validation dataset. I have trained Resnet model with changed architecture (3 kernels of (3,3) shape instead of 1 (7,7) kernel) without dilated convolutions and used it for training PSPNet, but there is a lot of deviation in accuracy. Could you please let me know the training protocol of Resnet models.

Question about 'manulseed'

What is manualSeed mean in the following code of train.py?

def main():
args = get_parser()
check(args)
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(str(x) for x in args.train_gpu)
if args.manual_seed is not None:
random.seed(args.manual_seed)
np.random.seed(args.manual_seed)
torch.manual_seed(manualSeed)
torch.cuda.manual_seed(manualSeed)
torch.cuda.manual_seed_all(manualSeed)
cudnn.benchmark = False
cudnn.deterministic = True
if args.dist_url == "env://" and args.world_size == -1:
args.world_size = int(os.environ["WORLD_SIZE"])
args.distributed = args.world_size > 1 or args.multiprocessing_distributed
args.ngpus_per_node = len(args.train_gpu)
if len(args.train_gpu) == 1:
args.sync_bn = False
args.distributed = False
args.multiprocessing_distributed = False
if args.multiprocessing_distributed:
args.world_size = args.ngpus_per_node * args.world_size
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
else:
main_worker(args.train_gpu, args.ngpus_per_node, args)

When I want to use the code of PSANet， I meet this error subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. . need help. thanks.

Traceback (most recent call last):
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 949, in _build_extension_module
check=True)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/subprocess.py", line 487, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 36, in
from models.psanet.psanet import PSANet
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/psanet.py", line 5, in
import models.psanet.lib.psa.functional as PF
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/functional.py", line 3, in
from . import functions
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/functions/init.py", line 1, in
from .psamask import *
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/functions/psamask.py", line 3, in
from .. import src
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/init.py", line 18, in
], build_directory=gpu_path, verbose=False)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 644, in load
is_python_module)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 813, in jit_compile
with_cuda=with_cuda)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 866, in write_ninja_file_and_build
build_extension_module(name, build_directory, verbose)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 962, in build_extension_module
raise RuntimeError(message)
RuntimeError: Error building extension 'psamask_gpu': b'[1/3] /usr/local/cuda-10.0/bin/bin/nvcc -DTORCH_EXTENSION_NAME=psamask_gpu -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/TH -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda-10.0/bin/include -isystem /home/wuwenfu/.conda/envs/pytorch/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -std=c++11 -c /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/psamask_cuda.cu -o psamask_cuda.cuda.o\nFAILED: psamask_cuda.cuda.o \n/usr/local/cuda-10.0/bin/bin/nvcc -DTORCH_EXTENSION_NAME=psamask_gpu -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/TH -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda-10.0/bin/include -isystem /home/wuwenfu/.conda/envs/pytorch/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -std=c++11 -c /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/psamask_cuda.cu -o psamask_cuda.cuda.o\n/bin/sh: 1: /usr/local/cuda-10.0/bin/bin/nvcc: not found\n[2/3] c++ -MMD -MF operator.o.d -DTORCH_EXTENSION_NAME=psamask_gpu -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/TH -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda-10.0/bin/include -isystem /home/wuwenfu/.conda/envs/pytorch/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -c /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/operator.cpp -o operator.o\nIn file included from /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/operator.h:1:0,\n from /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/operator.cpp:1:\n/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/torch.h:7:2: warning: #warning "Including torch/torch.h for C++ extensions is deprecated. Please include torch/extension.h" [-Wcpp]\n #warning \\n ^\nninja: build stopped: subcommand failed.\n'

I need help . thanks.

About the args.batch_size_val

Hi, hs.
In your code,

if args.distributed: torch.cuda.set_device(gpu) args.batch_size = int(args.batch_size / ngpus_per_node) args.batch_size_val = int(args.batch_size_val / ngpus_per_node) args.workers = int(args.workers / ngpus_per_node)

I think the default batch_size_val should be as the same as the ngpus_per_node, or get an error:
ValueError: batch_size should be a positive integeral value, but got batch_size=0

An assertion fail...

When I want to train the network with 2 GPUS, I met this error:
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [363,0,0] Assertion t >= 0 && t < n_classes failed.
Is there any solution?

hszhao / semseg Goto Github PK

semseg's People

Contributors

Stargazers

Watchers

Forkers

semseg's Issues

Here is my config file:

My environment:

Recommend Projects

Recommend Topics

Recommend Org

Jobs