I train the network on Pascal VOC dataset. The training progresses well at the beg

Check the last comments of <a class="issue-link js-issue-link" data-error-text="Failed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

an errors when training on pascal voc dataset about blitznet HOT 18 CLOSED

dvornikita commented on July 24, 2024

an errors when training on pascal voc dataset

from blitznet.

Comments (18)

dvornikita commented on July 24, 2024

Check the last comments of #5. I guess your batch is too small.

from blitznet.

Engineering-Course commented on July 24, 2024

I set the batch size to 4 because of the limitation of GPU memory.
I agree that it is caused by data augmentation when the random crop doesn't contain an object.
I find that the function tf.image.sample_distorted_bounding_box is used to generate distorted bounding boxes and the min_object_covered is set to [0.0, 0.1, 0.3, 0.5, 0.7, 0.9] (sample_jaccards).
Is it necessary to set the min_object_covered to 0.0? Is it works for this error if 0.0 is removed from the sample_jaccards array?
I think the object-missing crop will still exist but the probability will decrease.

for iou in params['sample_jaccards']: sample_distorted_bounding_box = tf.image.sample_distorted_bounding_box( tf.shape(image), bounding_boxes=bboxes, min_object_covered=iou, aspect_ratio_range=[0.5, 2.0], area_range=[0.3, 1.0], max_attempts=params['crop_max_tries'], use_image_if_no_bounding_boxes=True) samplers.append(sample_distorted_bounding_box[:2]) boxes.append(sample_distorted_bounding_box[2][0][0])

data_augmentation_config = { 'X_out': 4, 'brightness_prob': 0.5, 'brightness_delta': 0.125, 'contrast_prob': 0.5, 'contrast_delta': 0.5, 'hue_prob': 0.5, 'hue_delta': 0.07, 'saturation_prob': 0.5, 'saturation_delta': 0.5, 'sample_jaccards': [0.0, 0.1, 0.3, 0.5, 0.7, 0.9], 'flip_prob': 0.5, 'crop_max_tries': 50, 'zoomout_color': [x/255.0 for x in reversed(MEAN_COLOR)], }

from blitznet.

dvornikita commented on July 24, 2024

For data augmentation, we followed the strategy of SSD paper. We didn't evaluate the cases of constraining sampling in this way. You can see some evaluations in the original paper.
What you can do is to not make a forward pass in case you have no positives. This would require minimum modifications in the code.

from blitznet.

Engineering-Course commented on July 24, 2024

I checked the code in training.py.
if number_of_positives is zere, then number_of_negatives will become zero which may cause an error in tf.nn.top_k function.
Dose it work that a line is appened to set number_of_negatives at least one?

def detection_loss(location, confidence, refine_ph, classes_ph, pos_mask):
    neg_mask = tf.logical_not(pos_mask)
    number_of_positives = tf.reduce_sum(tf.to_int32(pos_mask))
    number_of_negatives = tf.minimum(3 * number_of_positives,
                                    tf.shape(pos_mask)[1] - number_of_positives)
    normalizer = tf.to_float(tf.add(number_of_positives, number_of_negatives))
    tf.summary.scalar('batch/size', normalizer)
    num_pos_float = tf.to_float(number_of_positives)

    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=confidence,
                                                                   labels=classes_ph)
    pos_class_loss = tf.reduce_sum(tf.boolean_mask(cross_entropy, pos_mask))
    tf.summary.scalar('loss/class_pos', pos_class_loss / num_pos_float)
    top_k_worst, top_k_inds = tf.nn.top_k(tf.boolean_mask(cross_entropy, neg_mask),
                                        number_of_negatives)
    neg_class_loss = tf.reduce_sum(top_k_worst)
    class_loss = (neg_class_loss + pos_class_loss) / num_pos_float
    tf.summary.scalar('loss/class_neg', neg_class_loss / tf.to_float(number_of_negatives))
    tf.summary.scalar('loss/class', class_loss)

from blitznet.

dvornikita commented on July 24, 2024

Sorry, I didn't get you question.

from blitznet.

fastlater commented on July 24, 2024

@Engineering-Course I commented this issue and I was waiting to get my new gpu so I could continue testing using a higher batch size. However, in the meanwhile I tried with learning rate = 0 and batch_size = 1 and this error still coming out. @dvornikita Does it means that this error will come out if the batch size is not large enough? Does the code could be change a little to skip this error as mentioned @Engineering-Course ? I understand that batch size = 1 will have this error normally but I was thinking that batch size = 4 should get at least a non error result.

from blitznet.

Engineering-Course commented on July 24, 2024

If the batch size is small, it is more likely that the random cropped batch doesn't contain any objects, which means there are no positive samples.
When it happens, the number_of_positives is zero.

       number_of_positives = tf.reduce_sum(tf.to_int32(pos_mask))

Then number_of_negatives becomes zero too.

        number_of_negatives = tf.minimum(3 * number_of_positives,
                                        tf.shape(pos_mask)[1] - number_of_positives)

It will thus cause an error in tf.nn.top_k function.

        top_k_worst, top_k_inds = tf.nn.top_k(tf.boolean_mask(cross_entropy, neg_mask),
                                            number_of_negatives)

So I recommend to appened a line to set number_of_negatives at least one before calling tf.nn.top_k function.

        number_of_negatives = tf.maximum(1, number_of_negatives)

These codes are in detection_loss function in training.py

from blitznet.

fastlater commented on July 24, 2024

@Engineering-Course let me know when you test it, and if you overcome the error. As I mentioned, I cannot do it by myself right now because my gpu is not good and I cannot even set batch size to 2.

from blitznet.

Engineering-Course commented on July 24, 2024

You can try it even when batch size is 1. It works for me.

from blitznet.

dvornikita commented on July 24, 2024

@fastlater, The solution of @Engineering-Course should work fine. Just note that in this case you learn from only one negative example, normalizing the loss by one, which gives the loss of the same order of magnitude as usual but the signal is not very desirable. This might bias the training, especially if this situation comes up often. So in addition to that, I would multiply the loss by zero if this occurs.

from blitznet.

Engineering-Course commented on July 24, 2024

Yes. Agree with you.

from blitznet.

fastlater commented on July 24, 2024

@dvornikita @Engineering-Course Thank you for feedback. Thus, I will have to multiply the neg_class_loss by zero? Is that loss that you are talking about? or it is the class_loss? Let me know if you will add this lines to the code. I guess it will be good, at least for testing the training.

from blitznet.

dvornikita commented on July 24, 2024

Pushed that modification. You can test it.

from blitznet.

fastlater commented on July 24, 2024

@dvornikita @Engineering-Course I tried it just for testing with batch_size=1, and the error now is: Nan in summary histogram for: summarize_grads/ssd/confidence/ssd_back/block_rev2/weights_gradiant. This error came out when you tested the code? PD.:I only modified the training.py.

from blitznet.

dvornikita commented on July 24, 2024

@fastlater I fixed this in the last commit. Apparently, the error was caused by bbox_loss since it uses the smooth l1 loss that also breaks with no positives. Now the training doesn't break but I didn't manage to make it learn something meaningful with the batch size of 1, which is not so surprising.

from blitznet.

fastlater commented on July 24, 2024

@dvornikita As we all expected, it wont learn something meaningful. However, it was good to fix it.

from blitznet.

clxia12 commented on July 24, 2024

@Engineering-Course where is the training.py. I meet the same error too. but i can't find this file in my folder. can you tell me where it is in detial.

from blitznet.

dvornikita commented on July 24, 2024

@clxia12 It's in the root folder of the project

from blitznet.

an errors when training on pascal voc dataset about blitznet HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs