GithubHelp home page GithubHelp logo

semseg's People

Contributors

gvi-lab avatar hszhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

semseg's Issues

Multi-gpu inference

I am interested in doing single node multi-gpu inference. Pytorch dataparallel does not allow for inputs of variable size. In the PSPNet code separete chunks of data are feeded using parfor. Do you have any suggestions for doing multi-gpu inference?

Documentation about running the demo on CPU

Although REAME describes well the necessary steps to get a prediction demo, I think it would be nice to add some information about doing the same without GPU. Maybe the best way would be changing tool/demo.py to support some additional CLI argument, but documenting the necessary changes would already be an improvement.

These changes were sufficient to make the demo work without GPU:

diff --git a/tool/demo.py b/tool/demo.py
index 6014081..1ae1bd3 100755
--- a/tool/demo.py
+++ b/tool/demo.py
@@ -92,11 +92,11 @@ def main():
                        normalization_factor=args.normalization_factor, psa_softmax=args.psa_softmax,
                        pretrained=False)
     logger.info(model)
-    model = torch.nn.DataParallel(model).cuda()
-    cudnn.benchmark = True
+    model = torch.nn.DataParallel(model)
+    cudnn.benchmark = False
     if os.path.isfile(args.model_path):
         logger.info("=> loading checkpoint '{}'".format(args.model_path))
-        checkpoint = torch.load(args.model_path)
+        checkpoint = torch.load(args.model_path, map_location=torch.device('cpu'))
         model.load_state_dict(checkpoint['state_dict'], strict=False)
         logger.info("=> loaded checkpoint '{}'".format(args.model_path))
     else:
@@ -112,7 +112,7 @@ def net_process(model, image, mean, std=None, flip=True):
     else:
         for t, m, s in zip(input, mean, std):
             t.sub_(m).div_(s)
-    input = input.unsqueeze(0).cuda()
+    input = input.unsqueeze(0)
     if flip:
         input = torch.cat([input, input.flip(3)], 0)
     with torch.no_grad():

In addition, recommending the installation of PyYAML and opencv-python (not official, but very handy) as part of the minimum requirements (tensorboardX and apex are not really necessary for a demo, right?) could also help those trying to do some quick tests.

OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).

When trying to use multithread, the following error occurs:

OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).                                                                                           │··································
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope│··································
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://│··································
www.intel.com/software/products/support/.                                                                                                              │··································
Traceback (most recent call last):                                                                                                                     │··································
  File "tool/train.py", line 456, in <module>                                                                                                          │··································
    main()                                                                                                                                             │··································
  File "tool/train.py", line 106, in main                                                                                                              │··································
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))                                                                │··································
  File "/home/lzx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn                             │··································
    while not spawn_context.join():                                                                                                                    │··································
  File "/home/lzx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join                              │··································
    (error_index, name)                                                                                                                                │··································
Exception: process 2 terminated with signal SIGABRT

This will happen when multiprocessing_distributed and use_apex are True.
Configuration details:

TRAIN:
  arch: psp
  layers: 101
  sync_bn: False  # adopt syncbn or not
  train_h: 585
  train_w: 585
  scale_min: 0.5  # minimum random scale
  scale_max: 2.0  # maximum random scale
  rotate_min: -10  # minimum random rotate
  rotate_max: 10  # maximum random rotate
  zoom_factor: 8  # zoom factor for final prediction during training, be in [1, 2, 4, 8]
  ignore_label: 255
  aux_weight: 0.4
  train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
  workers: 16  # data loader workers
  batch_size: 16  # batch size for training
  batch_size_val: 8  # batch size for validation during training, memory and speed tradeoff
  base_lr: 0.01
  epochs: 200
  start_epoch: 0
  new_epoch: 300  # for resume, how many new epochs to train
  power: 0.9
  momentum: 0.9
  weight_decay: 0.0001
  manual_seed: 520
  print_freq: 10
  save_freq: 1
  save_path: exp/cityscapes/pspnet101/model
  weight:  # path to initial weight (default: none)
  resume:  # path to latest checkpoint (default: none)
  evaluate: True # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
  dist_url: tcp://127.0.0.1:6789
  dist_backend: 'nccl'
  multiprocessing_distributed: True
  world_size: 1
  rank: 0
  use_apex: True
  opt_level: 'O0'
  keep_batchnorm_fp32:
  loss_scale:

I have tried to do some search but they are not working.

Could you please give me some advice?

Thanks a lot!

Getting Low mIOU on pretrained models.

When performing evaluation on the Cityscapes validation set using the provided pspnet_resnet101 pretrained model, I get a much lower result than stated. Anything im missing?

[2019-09-20 22:33:20,678 INFO test.py line 249 24510] Eval result: mIoU/mAcc/allAcc 0.0039/0.0165/0.0077. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_0 result: iou/accuracy 0.0000/0.0001, name: road. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_1 result: iou/accuracy 0.0051/0.0112, name: sidewalk. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_2 result: iou/accuracy 0.0149/0.1796, name: building. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_3 result: iou/accuracy 0.0032/0.0045, name: wall. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_4 result: iou/accuracy 0.0350/0.0555, name: fence. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_5 result: iou/accuracy 0.0097/0.0398, name: pole. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_6 result: iou/accuracy 0.0000/0.0000, name: traffic light. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_7 result: iou/accuracy 0.0001/0.0001, name: traffic sign. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_8 result: iou/accuracy 0.0008/0.0035, name: vegetation. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_9 result: iou/accuracy 0.0030/0.0087, name: terrain. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_10 result: iou/accuracy 0.0000/0.0000, name: sky. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_11 result: iou/accuracy 0.0015/0.0016, name: person. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_12 result: iou/accuracy 0.0004/0.0005, name: rider. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_13 result: iou/accuracy 0.0005/0.0046, name: car. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_14 result: iou/accuracy 0.0000/0.0000, name: truck. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_15 result: iou/accuracy 0.0001/0.0013, name: bus. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_16 result: iou/accuracy 0.0000/0.0000, name: train. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_17 result: iou/accuracy 0.0005/0.0005, name: motorcycle. [2019-09-20 22:33:20,678 INFO test.py line 251 24510] Class_18 result: iou/accuracy 0.0003/0.0013, name: bicycle.

inference time

Hi,

What is model speed in terms of interference time (i.e frames per second) and what is image size in that case?

Thanks,

Why do we have 8x upsampling in the end in PSPNet?

8x bilinear upsampling is a non-learnable operation and doing it in the end seems very unintuitive.
I can see the results as shown in this mage.

image
left is ground truth and right are predictions.
As you can see the small features are not being detected, i.e the building and the roads, and blobs are coming up.

question about psanet

Iin psamask.py, i think the code about the output is the same whether self.psa_type==1 or 2.

                if self.psa_type == 0:  # col
                    output[n, :, h, w] = mask_com.view(-1)
                else:  # dis
                    c = h * feature_W_ + w
                    output[n, c, :, :] = mask_com.view(feature_H_, feature_W_)
    if self.psa_type == 1:  # dis
        output = output.view(num_, feature_H_ * feature_W_, feature_H_ * feature_W_).transpose(1, 2).view(num_, feature_H_ * feature_W_, feature_H_, feature_W_)

because using .transpose and .view equal to output[n, :, h, w] = mask_com.view(-1)

No speed up when using larger batch size

I run this code on a 2xTesla p100 machine with pytorch1.5.
And the training time keeps the same no matter how much I set the batch size.
For example, if I set bs to 4, then it took one second per-iteration; and if I set bs to 16, it took 4 seconds per-iteration. Shouldn't it still be 1 second?
Is there something wrong with the multiprocessing part?

some question about test.py ?

I cant' understand the mean about follow code:

    def scale_process(self, image, crop_h, crop_w, h, w, mean, std=None, stride_rate=2 / 3):
        ori_h, ori_w, _ = image.shape
        # 填充周边
        pad_h = max(crop_h - ori_h, 0)
        pad_w = max(crop_w - ori_w, 0)
        pad_h_half = int(pad_h / 2)
        pad_w_half = int(pad_w / 2)
        if pad_h > 0 or pad_w > 0:
            image = cv2.copyMakeBorder(image, pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half,
                                       cv2.BORDER_CONSTANT, value=mean)
        new_h, new_w, _ = image.shape
        # FAQ:
        stride_h = int(np.ceil(crop_h * stride_rate))
        stride_w = int(np.ceil(crop_w * stride_rate))
        grid_h = int(np.ceil(float(new_h - crop_h) / stride_h) + 1)
        grid_w = int(np.ceil(float(new_w - crop_w) / stride_w) + 1)
        prediction_crop = np.zeros((new_h, new_w, self.numclass), dtype=float)
        count_crop = np.zeros((new_h, new_w), float)
        for index_h in range(0, grid_h):
            for index_w in range(0, grid_w):
                s_h = index_h * stride_h
                e_h = min(s_h + crop_h, new_h)
                s_h = e_h - crop_h
                s_w = index_w * stride_w
                e_w = min(s_w + crop_w, new_w)
                s_w = e_w - crop_w
                image_crop = image[s_h:e_h, s_w:e_w].copy()
                count_crop[s_h:e_h, s_w:e_w] += 1
                prediction_crop[s_h:e_h, s_w:e_w, :] += self.net_process(image_crop, mean, std)
        prediction_crop /= np.expand_dims(count_crop, 2)
        ..........

i know that it need to process with the train height、width,But I dont't understand! Appreciate with any suggest!

PSANet is stuck in modeling

I have tried to train PSPNet50 and PSANet50.
PSPNet has been successfully trained, but PSANet is stuck in building up model.
Has anyone encountered similar problem?

How to set up data and labels

for example:
I have data and labels, Just identify the building from the picture.so labe just have backdrop and building,its 0 and 255.I revise the txt file in dataset.I put then in right position and created a right txt file, but the loss very big or little,
misconvergence .I dont know why?
because I just have the building data and labels?
And when I test, found a color file and a gray file, I have no idea what that means?

can you tell me how to do please?

Pascal VOC transform GT to all 255

I found a bug in transform.py but not sure about where the bug is. The loaded semantic segmentation label sometimes turns to all 255 after performing transformation. The code for transformation is:

train_transform = transform.Compose([ transform.RandScale([args.scale_min, args.scale_max]), transform.RandRotate([args.rotate_min, args.rotate_max], padding=mean, ignore_label=args.ignore_label), transform.RandomGaussianBlur(), transform.RandomHorizontalFlip(), transform.Crop([args.train_h, args.train_w], crop_type='rand', padding=mean, ignore_label=args.ignore_label), transform.ToTensor(), transform.Normalize(mean=mean, std=std)])

and the configuration is:

train_h: 513 train_w: 513 scale_min: 0.5 # minimum random scale scale_max: 2.0 # maximum random scale rotate_min: -10 # minimum random rotate rotate_max: 10 # maximum random rotate

Object detection and Instance segmentation support

Hi,

I just to want to confirm, will the same code work for detection and instance segmentation if i train using bdd100k data set?

It would be great if you can confirm. Also highlight how can we add instance segmentation and detection if not already?

Thanks,

Training is slow

Hi, Thank you for sharing the code! The work is very exciting and the code is elegantly written.
Here is a question. I trained on cityscapes resnet101 with the default setting except batch_size is set to 8. For 14 hours I only trained 96epochs. It trains quickly at the beginning and then gets slower and slower.
Hardware information: 2 v100 GPUs(32G GPU memory). The loss does not converge well.
image

About loading pretrained resnet into PSPNet

The lower part of PSPNet is slightly different from that of ResNet. PSPnet has three layers of {conv, relu, bn} while ResNet only has one layer. Did you re-train the new-ResNet (with 3 {conv, relu, bn} layers) in ImageNet dataset? I see in the code that you load a model called "ResNet50_v2", is it the re-trained model?

Undefined name 'logger' in util/config.py

flake8 testing of https://github.com/hszhao/semseg on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./util/config.py:164:9: F821 undefined name 'logger'
        logger.debug(msg)
        ^
1     F821 undefined name 'logger'
1

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

  • F821: undefined name name
  • F822: undefined name name in __all__
  • F823: local variable name referenced before assignment
  • E901: SyntaxError or IndentationError
  • E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

about how do we choose the right epochs

hi,how should we choose the right epochs to make sure that our train is the best, because in semseg,it save the last two .pth of train epochs, how do we konw the right epoch, please reply when you see it,
thanks

train error

when I run the train.sh

sometimes error is:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

sometimes error is:
RuntimeError: cuDNN error: CUDNN_STATUS_ALLOC_FAILED

and maybe
RuntimeError: CUDA error: device-side assert triggered

I use
pytorch 1.1
cuda 9.0
cudnn 7.1.4

some problem with resize

when I used train images with resize(10242048),randomcrop size(5121024),I can get the performance in valsets with 76.35,but when I changed resize(5121024),randomcrop(384768),I only can get the performance in valsets with 72. can you help me?(bencause I have to use the setting for the future work),thank you very much

Question about PsaNet

In the code, what is the function of psamask module, how to understand this module?

Resume training performance drop

Hi, I'm trying to use your fantastic code but encountered something confused when I try to load a checkpoint of 200 epoch training.

The mIoU of training reached 0.739 at epoch 200. But when I try to load this checkpoint and continually training, the performance dropped from 0.739 to 0.563 at epoch 201.

It's quite hard for me to understand this phenomenon, could you give me some advices?
Thank you so much!

Here is my config file:

DATA:
  data_root: /datasets-ssd/cityscapes
  train_list: dataset/cityscapes/list/fine_train.txt
  val_list: dataset/cityscapes/list/fine_val.txt
  classes: 19

TRAIN:
  arch: psp
  layers: 101
  sync_bn: True  # adopt syncbn or not
  train_h: 513
  train_w: 513
  scale_min: 0.5  # minimum random scale
  scale_max: 2.0  # maximum random scale
  rotate_min: -10  # minimum random rotate
  rotate_max: 10  # maximum random rotate
  zoom_factor: 8  # zoom factor for final prediction during training, be in [1, 2, 4, 8]
  ignore_label: 255
  aux_weight: 0.4
#  train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
  train_gpu: [0]
  workers: 6  # data loader workers
  batch_size: 2  # batch size for training
  batch_size_val: 1  # batch size for validation during training, memory and speed tradeoff
  base_lr: 0.01
  epochs: 400
  start_epoch: 200
  power: 0.9
  momentum: 0.9
  weight_decay: 0.0001
  manual_seed: 520
  print_freq: 10
  save_freq: 1
  save_path: exp/cityscapes/pspnet101/model
  weight:  # path to initial weight (default: none)
  resume:  exp/cityscapes/pspnet101/model/train_epoch_200.pth # path to latest checkpoint (default: none)
  evaluate: False  # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
  distributed: True
  dist_url: tcp://127.0.0.1:6789
  dist_backend: 'nccl'
  multiprocessing_distributed: True
  world_size: 1
  rank: 0
  use_apex: True
  opt_level: 'O0'
  keep_batchnorm_fp32:
  loss_scale:

TEST:
  test_list: dataset/cityscapes/list/fine_val.txt
  split: val  # split in [train, val and test]
  base_size: 2048  # based size for scaling
  test_h: 713
  test_w: 713
  scales: [1.0]  # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
  has_prediction: False  # has prediction already or not
  index_start: 0  # evaluation start index in list
  index_step: 0  # evaluation step index in list, 0 means to end
  test_gpu: [0]
  model_path: exp/cityscapes/pspnet101/model/train_epoch_400.pth  # evaluation model path
  save_folder: exp/cityscapes/pspnet101/result/epoch_400/val/ss  # results save folder
  colors_path: dataset/cityscapes/cityscapes_colors.txt  # path of dataset colors
  names_path: dataset/cityscapes/cityscapes_names.txt  # path of dataset category names

My environment:

  • 2080ti, 12G
  • CUDA 10.0
  • python 3.6

Hoping for your reply!
Thank you again!

Image shape requirement

My customized dataset image.shape=(3, 512, 512).
In psanet.py line 157

assert(x_size[2]-1)%8 == 0 and (s_size[3]-1)%8==0
h =  int((x_size[2] - 1)  / 8 * self.zoom_factor + 1)
w = int((x_size[3] - 1) /8 * self.zoom_factor + 1)

How should I modification the code to work ?
Ps:
if I just comment line 157, it will throw a RuntimeError in pasnet.py line 98

invalid argument 0: Sizes of tensors must match except in dimension1, Got 63 and 64 in dimension 2

RuntimeError: CUDA out of memory.

Totally 20210 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
Traceback (most recent call last):
  File "tool/train.py", line 426, in <module>
    main()
  File "tool/train.py", line 107, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 236, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 281, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/Desktop/semseg/model/pspnet.py", line 91, in forward
    x = self.layer4(x_tmp)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/Desktop/semseg/model/resnet.py", line 87, in forward
    out = self.bn3(out)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
    exponential_average_factor, self.eps)
  File "/home/xrlin/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1656, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 450.00 MiB (GPU 0; 23.65 GiB total capacity; 22.42 GiB already allocated; 105.12 MiB free; 311.16 MiB cached)

I tried to run every configuration in this project with one GPU (Titan RTX with 24GB memory), but it shows CUDA out of memory, which is wired because there are no other programs using the GPU. Do you have any suggestions for this issue?

RuntimeError: => no checkpoint found at 'exp/ade20k/pspnet50/model/train_epoch_100.pth'

I got the problem when I first sh tool/train.sh ade20k pspnet50 .

Traceback (most recent call last):
File "tool/test.py", line 255, in
main()
File "tool/test.py", line 117, in main
raise RuntimeError("=> no checkpoint found at '{}'".format(args.model_path))
RuntimeError: => no checkpoint found at 'exp/ade20k/pspnet50/model/train_epoch_100.pth'

DATA:
data_root: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k
train_list: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k/list/training.txt
val_list: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k/list/validation.txt
classes: 2

TRAIN:
arch: psp
layers: 50
sync_bn: True # adopt sync_bn or not
train_h: 473
train_w: 473
scale_min: 0.5 # minimum random scale
scale_max: 2.0 # maximum random scale
rotate_min: -10 # minimum random rotate
rotate_max: 10 # maximum random rotate
zoom_factor: 8 # zoom factor for final prediction during training, be in [1, 2, 4, 8]
ignore_label: 255
aux_weight: 0.4
train_gpu: [5, 6]
workers: 16 # data loader workers
batch_size: 16 # batch size for training
batch_size_val: 8 # batch size for validation during training, memory and speed tradeoff
base_lr: 0.01
epochs: 100
start_epoch: 0
power: 0.9
momentum: 0.9
weight_decay: 0.0001
manual_seed:
print_freq: 10
save_freq: 1
save_path: exp/ade20k/pspnet50/model
weight: # path to initial weight (default: none)
resume: # path to latest checkpoint (default: none)
evaluate: False # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
dist_url: tcp://127.0.0.1:6789
dist_backend: 'nccl'
multiprocessing_distributed: True
world_size: 1
rank: 0
use_apex: True
opt_level: 'O0'
keep_batchnorm_fp32:
loss_scale:

TEST:
test_list: /home/zhangxl003/sheyulong/semseg-master/dataset/ade20k/list/validation.txt
split: val # split in [train, val and test]
base_size: 512 # based size for scaling
test_h: 473
test_w: 473
scales: [1.0] # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
has_prediction: False # has prediction already or not
index_start: 0 # evaluation start index in list
index_step: 0 # evaluation step index in list, 0 means to end
test_gpu: [5]
model_path: exp/ade20k/pspnet50/model/train_epoch_100.pth # evaluation model path
save_folder: exp/ade20k/pspnet50/result/epoch_100/val/ss # results save folder
colors_path: dataset/ade20k/ade20k_colors.txt # path of dataset colors
names_path: dataset/ade20k/ade20k_names.txt # path of dataset category names

Getting lower mIoU in the cityscape

Hi, I used the checkpoint downloaded from the provided GoogleDrive---pspnet/train-epoch-200 to test the cityscape val dataset using 'sh tool/test.sh cityscapes pspnet50',
but I got very low mIoU: Eval result: mIoU/mAcc/allAcc 0.0038/0.0150/0.0069.
After I saw the colored predicted picture, I find the satisfactory results. It seems the gray predicted picture is painted with wrong ids !
can you tell me how to fix it ?

A little mistake in intersectionAndUnion

Hi!
I use this code to train a model on Camvid, and the ignore_label is 11 on this dataset.
I found here is a little mistake in the testing code.
We should add a line in the 'intersectionAndUnion' in 'util/utils.py', and also add
'intersection, union, target = intersectionAndUnion(pred, target, classes**,ignore_index=args.ignore_label**)' in tool/test.py

def intersectionAndUnion(output, target, K, ignore_index=255):
# 'K' classes, output and target sizes are N or N * L or N * H * W, each value in range 0 to K - 1.
assert (output.ndim in [1, 2, 3])
assert output.shape == target.shape
output = output.reshape(output.size).copy()
target = target.reshape(target.size)
output[np.where(target == ignore_index)[0]] = 255
target[np.where(target == ignore_index)[0]] = 255
intersection = output[np.where(output == target)[0]]
area_intersection, _ = np.histogram(intersection, bins=np.arange(K+1))
area_output, _ = np.histogram(output, bins=np.arange(K+1))
area_target, _ = np.histogram(target, bins=np.arange(K+1))
area_union = area_output + area_target - area_intersection
return area_intersection, area_union, area_target

About train on my own datasets.

Hi~! I want to train PSPNet on my own datasets. My own datasets: original images + gray label images(I have set the useful class as 0-(C-1) and the ignoring class as 255. The gray value of each pixel is the class number).

Now I want to know how to set up the $DATASET$_colors.txt? Can I set up as follows?
1
2
3
...
Hope for your reply, thanks!

definition of ppm

Hi, thank you for the open-source, I wanted to ask you about ppm, in the paper you did not mention it, what is the purpose of it ?. Is using it good or bad?

How to train with the model with the given scripts?

When I attempt to train the code with the given script, I get the following errors:

tool/train.sh: 15: tool/train.sh: -u: not found
tool/train.sh: 19: tool/train.sh: -u: not found

Could you give me your hand?
I am using conda.

Thanks

AttributeError: module 'apex' has no attribute 'parallel'

    BatchNorm = apex.parallel.SyncBatchNorm
AttributeError: module 'apex' has no attribute 'parallel'

Here is the config detail:

TRAIN:
  arch: pspnet
  layers: 101
  sync_bn: True  # adopt syncbn or not
  train_h: 713
  train_w: 713
  scale_min: 0.5  # minimum random scale
  scale_max: 2.0  # maximum random scale
  rotate_min: -10  # minimum random rotate
  rotate_max: 10  # maximum random rotate
  zoom_factor: 8  # zoom factor for final prediction during training, be in [1, 2, 4, 8]
  ignore_label: 255
  aux_weight: 0.4
#  train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
  train_gpu: [2,3]
  workers: 12  # data loader workers
  batch_size: 4  # total batch size for training
  batch_size_val: 1  # batch size for validation during training, memory and speed tradeoff
  base_lr: 0.01
  epochs: 300
  start_epoch: 0
  power: 0.9
  momentum: 0.9
  weight_decay: 0.0001
  manual_seed: 520
  print_freq: 10
  save_freq: 1
  save_path: exp/cityscapes/pspnet101/model
  weight:  # path to initial weight (default: none)
  resume:  # path to latest checkpoint (default: none)
  evaluate: True  # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
  distributed: True
  dist_url: tcp://127.0.0.1:6789
  dist_backend: 'nccl'
  multiprocessing_distributed: True
  world_size: 1
  rank: 0
  use_apex: True
  opt_level: 'O0'
  keep_batchnorm_fp32:
  loss_scale:

Can you help me with this apex problem?

Assertion `t >= 0 && t < n_classes` failed

Thanks for sharing your codes! I was not able to run the training code for Cityscapes dataset. Below you can see my configuration file and the error messages. This seems to be related to the labels being out of range. I looked at your loader (SemData). It reads the label files (in my case color label files from Cityscapes dataset) but does not convert them to the range [0, n_classes-1]. Could you have a look at it? Thanks a lot!

DATA:
data_root: /content/CityScapes_modified
train_list: dataset/cityscapes/fine_train.txt
val_list: dataset/cityscapes/fine_val.txt
classes: 19

TRAIN:
arch: psp
layers: 50
sync_bn: True # adopt syncbn or not
train_h: 713
train_w: 713
scale_min: 0.5 # minimum random scale
scale_max: 2.0 # maximum random scale
rotate_min: -10 # minimum random rotate
rotate_max: 10 # maximum random rotate
zoom_factor: 8 # zoom factor for final prediction during training, be in [1, 2, 4, 8]
ignore_label: 255
aux_weight: 0.4
train_gpu: [0]
workers: 4 # data loader workers
batch_size: 2 # batch size for training
batch_size_val: 2 # batch size for validation during training, memory and speed tradeoff
base_lr: 0.01
epochs: 200
start_epoch: 0
power: 0.9
momentum: 0.9
weight_decay: 0.0001
manual_seed:
print_freq: 10
save_freq: 1
save_path: exp/cityscapes/pspnet50/model
weight: # path to initial weight (default: none)
resume: # path to latest checkpoint (default: none)
evaluate: False # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
dist_url: tcp://127.0.0.1:6789
dist_backend: 'nccl'
multiprocessing_distributed: False
world_size: 1
rank: 0
use_apex: True
opt_level: 'O0'
keep_batchnorm_fp32:
loss_scale:

TEST:
test_list: dataset/cityscapes/fine_val.txt
split: val # split in [train, val and test]
base_size: 2048 # based size for scaling
test_h: 713
test_w: 713
scales: [1.0] # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
has_prediction: False # has prediction already or not
index_start: 0 # evaluation start index in list
index_step: 0 # evaluation step index in list, 0 means to end
test_gpu: [0]
model_path: exp/dataset/cityscapes/pspnet50/model/train_epoch_200.pth # evaluation model path
save_folder: exp/dataset/cityscapes/pspnet50/result/epoch_200/val/ss # results save folder
colors_path: dataset/cityscapes/cityscapes_colors.txt # path of dataset colors
names_path: dataset/cityscapes/cityscapes_names.txt # path of dataset category names

#######################################################################
Error messages:
.
.
.
Totally 2975 samples in train set.
Starting Checking image&label pair train list...
Checking image&label pair train list done!
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [160,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [161,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [162,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [163,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [164,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [165,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [166,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [167,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [168,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [256,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [257,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [258,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [259,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [260,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [261,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [262,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [263,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [264,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:104: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [3,0,0], thread: [265,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu line=127 error=710 : device-side assert triggered
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/semseg-master/tool/train.py", line 426, in
main()
File "/content/semseg-master/tool/train.py", line 107, in main
main_worker(args.train_gpu, args.ngpus_per_node, args)
File "/content/semseg-master/tool/train.py", line 236, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
File "/content/semseg-master/tool/train.py", line 281, in train
output, main_loss, aux_loss = model(input, target)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/content/semseg-master/model/pspnet.py", line 102, in forward
main_loss = self.criterion(x, y)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 916, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2009, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1840, in nll_loss
ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:127

Training performance drop on cityscapes with default parameters.

Training log.
[2019-06-14 18:10:31,725 INFO train.py line 154 10613] arch: psp
aux_weight: 0.4
base_lr: 0.01
base_size: 2048
batch_size: 16
batch_size_val: 1
classes: 19
colors_path: dataset/cityscapes/cityscapes_colors.txt
data_root: datasets/cityscapes/
dist_backend: nccl
dist_url: tcp://127.0.0.1:6789
distributed: True
epochs: 200
evaluate: False
has_prediction: False
ignore_label: 255
index_split: 5
index_start: 0
index_step: 0
keep_batchnorm_fp32: None
layers: 50
loss_scale: None
manual_seed: None
model_path: exp/cityscapes/pspnet50/model/train_epoch_200.pth
momentum: 0.9
multiprocessing_distributed: True
names_path: dataset/cityscapes/cityscapes_names.txt
ngpus_per_node: 8
opt_level: O0
power: 0.9
print_freq: 10
rank: 0
resume: None
rotate_max: 10
rotate_min: -10
save_folder: exp/cityscapes/pspnet50/result/epoch_200/val/ss
save_freq: 1
save_path: exp/cityscapes/pspnet50/model
scale_max: 2.0
scale_min: 0.5
scales: [1.0]
split: val
start_epoch: 0
sync_bn: True
test_gpu: [0]
test_h: 713
test_list: semseg-master/dataset/cityscapes/fine_val.txt
test_w: 713
train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
train_h: 713
train_list: semseg-master/dataset/cityscapes/fine_train.txt
train_w: 713
use_apex: True
val_list: semseg-master/dataset/cityscapes/fine_val.txt
weight: None
weight_decay: 0.0001
workers: 16
world_size: 8
zoom_factor: 8

Test log
[2019-06-16 08:08:38,259 INFO test.py line 249 23114] Eval result: mIoU/mAcc/allAcc 0.7695/0.8400/0.9603.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_0 result: iou/accuracy 0.9804/0.9881, name: road.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_1 result: iou/accuracy 0.8454/0.9255, name: sidewalk.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_2 result: iou/accuracy 0.9235/0.9677, name: building.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_3 result: iou/accuracy 0.5557/0.6280, name: wall.
[2019-06-16 08:08:38,259 INFO test.py line 251 23114] Class_4 result: iou/accuracy 0.6037/0.6966, name: fence.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_5 result: iou/accuracy 0.6419/0.7436, name: pole.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_6 result: iou/accuracy 0.7032/0.8118, name: traffic light.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_7 result: iou/accuracy 0.7856/0.8581, name: traffic sign.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_8 result: iou/accuracy 0.9260/0.9675, name: vegetation.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_9 result: iou/accuracy 0.6553/0.7457, name: terrain.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_10 result: iou/accuracy 0.9460/0.9770, name: sky.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_11 result: iou/accuracy 0.8237/0.9218, name: person.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_12 result: iou/accuracy 0.6327/0.7595, name: rider.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_13 result: iou/accuracy 0.9490/0.9782, name: car.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_14 result: iou/accuracy 0.7279/0.7737, name: truck.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_15 result: iou/accuracy 0.8610/0.9364, name: bus.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_16 result: iou/accuracy 0.6388/0.6618, name: train.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_17 result: iou/accuracy 0.6448/0.7306, name: motorcycle.
[2019-06-16 08:08:38,260 INFO test.py line 251 23114] Class_18 result: iou/accuracy 0.7760/0.8881, name: bicycle.

By the way, training time is much longer than your report. ( i use 8 P40)

I want to train pspnet50,but not succeed.

thx you code.when i input ‘sh tool/train.sh ade20k pspnet50’,i want to train pspnet50,but display the results of test .i don't know how change?Can you help me?
/semseg/config/ade20k/ade20k_pspnet101.yaml
DATA:
data_root: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/images/data/
train_list: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/training.txt
val_list: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/validation.txt
classes: 150

TRAIN:
arch: psp
layers: 50
sync_bn: True # adopt sync_bn or not
train_h: 473
train_w: 473
scale_min: 0.5 # minimum random scale
scale_max: 2.0 # maximum random scale
rotate_min: -10 # minimum random rotate
rotate_max: 10 # maximum random rotate
zoom_factor: 8 # zoom factor for final prediction during training, be in [1, 2, 4, 8]
ignore_label: 255
aux_weight: 0.4
train_gpu: [0, 1, 2, 3, 4, 5, 6, 7]
workers: 16 # data loader workers
batch_size: 16 # batch size for training
batch_size_val: 8 # batch size for validation during training, memory and speed tradeoff
base_lr: 0.01
epochs: 200
start_epoch: 101
power: 0.9
momentum: 0.9
weight_decay: 0.0001
manual_seed:
print_freq: 10
save_freq: 1
save_path: exp/ade20k/pspnet50/model
weight: # path to initial weight (default: none)
resume: exp/ade20k/pspnet50/model/train_epoch_100.pth # path to latest checkpoint (default: none)
evaluate: False # evaluate on validation set, extra gpu memory needed and small batch_size_val is recommend
Distributed:
dist_url: tcp://127.0.0.1:6789
dist_backend: 'nccl'
multiprocessing_distributed: True
world_size: 1
rank: 0
use_apex: True
opt_level: 'O0'
keep_batchnorm_fp32:
loss_scale:

TEST:
test_list: /DATA/maran/anaconda3/Project/semseg/data/ADEChallengeData2016/validation.txt
split: val # split in [train, val and test]
base_size: 512 # based size for scaling
test_h: 473
test_w: 473
scales: [1.0] # evaluation scales, ms as [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
has_prediction: False # has prediction already or not
index_start: 0 # evaluation start index in list
index_step: 0 # evaluation step index in list, 0 means to end
test_gpu: [0]
model_path: exp/ade20k/pspnet50/model/train_epoch_100.pth # evaluation model path
save_folder: exp/ade20k/pspnet50/result/epoch_100/val/ss # results save folder
colors_path: dataset/ade20k/ade20k_colors.txt # path of dataset colors
names_path: dataset/ade20k/ade20k_names.txt # path of dataset category names

training on bdd100k

Hi,

Can you please share data-loading code and configs to train on bdd100k?

Thanks,

Resnet training procedure

@hszhao, Did you train ResNet models on Imagenet dataset with dilated convolutions or use the kernels of original resnet model in PSPNet as dilated kernels?
I have used your pretrained resnet weights for initialising PSPNet and it's achieving published accuracy on validation dataset. I have trained Resnet model with changed architecture (3 kernels of (3,3) shape instead of 1 (7,7) kernel) without dilated convolutions and used it for training PSPNet, but there is a lot of deviation in accuracy. Could you please let me know the training protocol of Resnet models.

Question about 'manulseed'

What is manualSeed mean in the following code of train.py?

def main():
args = get_parser()
check(args)
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(str(x) for x in args.train_gpu)
if args.manual_seed is not None:
random.seed(args.manual_seed)
np.random.seed(args.manual_seed)
torch.manual_seed(manualSeed)
torch.cuda.manual_seed(manualSeed)
torch.cuda.manual_seed_all(manualSeed)
cudnn.benchmark = False
cudnn.deterministic = True
if args.dist_url == "env://" and args.world_size == -1:
args.world_size = int(os.environ["WORLD_SIZE"])
args.distributed = args.world_size > 1 or args.multiprocessing_distributed
args.ngpus_per_node = len(args.train_gpu)
if len(args.train_gpu) == 1:
args.sync_bn = False
args.distributed = False
args.multiprocessing_distributed = False
if args.multiprocessing_distributed:
args.world_size = args.ngpus_per_node * args.world_size
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
else:
main_worker(args.train_gpu, args.ngpus_per_node, args)

When I want to use the code of PSANet, I meet this error subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. . need help. thanks.

Traceback (most recent call last):
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 949, in _build_extension_module
check=True)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/subprocess.py", line 487, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 36, in
from models.psanet.psanet import PSANet
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/psanet.py", line 5, in
import models.psanet.lib.psa.functional as PF
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/functional.py", line 3, in
from . import functions
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/functions/init.py", line 1, in
from .psamask import *
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/functions/psamask.py", line 3, in
from .. import src
File "/mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/init.py", line 18, in
], build_directory=gpu_path, verbose=False)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 644, in load
is_python_module)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 813, in jit_compile
with_cuda=with_cuda)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 866, in write_ninja_file_and_build
build_extension_module(name, build_directory, verbose)
File "/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 962, in build_extension_module
raise RuntimeError(message)
RuntimeError: Error building extension 'psamask_gpu': b'[1/3] /usr/local/cuda-10.0/bin/bin/nvcc -DTORCH_EXTENSION_NAME=psamask_gpu -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/TH -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda-10.0/bin/include -isystem /home/wuwenfu/.conda/envs/pytorch/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -std=c++11 -c /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/psamask_cuda.cu -o psamask_cuda.cuda.o\nFAILED: psamask_cuda.cuda.o \n/usr/local/cuda-10.0/bin/bin/nvcc -DTORCH_EXTENSION_NAME=psamask_gpu -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/TH -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda-10.0/bin/include -isystem /home/wuwenfu/.conda/envs/pytorch/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -std=c++11 -c /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/psamask_cuda.cu -o psamask_cuda.cuda.o\n/bin/sh: 1: /usr/local/cuda-10.0/bin/bin/nvcc: not found\n[2/3] c++ -MMD -MF operator.o.d -DTORCH_EXTENSION_NAME=psamask_gpu -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/TH -isystem /home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda-10.0/bin/include -isystem /home/wuwenfu/.conda/envs/pytorch/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -c /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/operator.cpp -o operator.o\nIn file included from /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/operator.h:1:0,\n from /mnt/sda7/seg-master/seg-master/sun/models/psanet/lib/psa/src/gpu/operator.cpp:1:\n/home/wuwenfu/.conda/envs/pytorch/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/torch.h:7:2: warning: #warning "Including torch/torch.h for C++ extensions is deprecated. Please include torch/extension.h" [-Wcpp]\n #warning \\n ^\nninja: build stopped: subcommand failed.\n'

I need help . thanks.

About the args.batch_size_val

Hi, hs.
In your code,

if args.distributed: torch.cuda.set_device(gpu) args.batch_size = int(args.batch_size / ngpus_per_node) args.batch_size_val = int(args.batch_size_val / ngpus_per_node) args.workers = int(args.workers / ngpus_per_node)

I think the default batch_size_val should be as the same as the ngpus_per_node, or get an error:
ValueError: batch_size should be a positive integeral value, but got batch_size=0

An assertion fail...

When I want to train the network with 2 GPUS, I met this error:
/opt/conda/conda-bld/pytorch_1565272279342/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:103: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [363,0,0] Assertion t >= 0 && t < n_classes failed.
Is there any solution?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.