GithubHelp home page GithubHelp logo

Comments (18)

dvornikita avatar dvornikita commented on July 24, 2024

Check the last comments of #5. I guess your batch is too small.

from blitznet.

Engineering-Course avatar Engineering-Course commented on July 24, 2024

I set the batch size to 4 because of the limitation of GPU memory.
I agree that it is caused by data augmentation when the random crop doesn't contain an object.
I find that the function tf.image.sample_distorted_bounding_box is used to generate distorted bounding boxes and the min_object_covered is set to [0.0, 0.1, 0.3, 0.5, 0.7, 0.9] (sample_jaccards).
Is it necessary to set the min_object_covered to 0.0? Is it works for this error if 0.0 is removed from the sample_jaccards array?
I think the object-missing crop will still exist but the probability will decrease.

for iou in params['sample_jaccards']: sample_distorted_bounding_box = tf.image.sample_distorted_bounding_box( tf.shape(image), bounding_boxes=bboxes, min_object_covered=iou, aspect_ratio_range=[0.5, 2.0], area_range=[0.3, 1.0], max_attempts=params['crop_max_tries'], use_image_if_no_bounding_boxes=True) samplers.append(sample_distorted_bounding_box[:2]) boxes.append(sample_distorted_bounding_box[2][0][0])

data_augmentation_config = { 'X_out': 4, 'brightness_prob': 0.5, 'brightness_delta': 0.125, 'contrast_prob': 0.5, 'contrast_delta': 0.5, 'hue_prob': 0.5, 'hue_delta': 0.07, 'saturation_prob': 0.5, 'saturation_delta': 0.5, 'sample_jaccards': [0.0, 0.1, 0.3, 0.5, 0.7, 0.9], 'flip_prob': 0.5, 'crop_max_tries': 50, 'zoomout_color': [x/255.0 for x in reversed(MEAN_COLOR)], }

from blitznet.

dvornikita avatar dvornikita commented on July 24, 2024

For data augmentation, we followed the strategy of SSD paper. We didn't evaluate the cases of constraining sampling in this way. You can see some evaluations in the original paper.
What you can do is to not make a forward pass in case you have no positives. This would require minimum modifications in the code.

from blitznet.

Engineering-Course avatar Engineering-Course commented on July 24, 2024

I checked the code in training.py.
if number_of_positives is zere, then number_of_negatives will become zero which may cause an error in tf.nn.top_k function.
Dose it work that a line is appened to set number_of_negatives at least one?

def detection_loss(location, confidence, refine_ph, classes_ph, pos_mask):
    neg_mask = tf.logical_not(pos_mask)
    number_of_positives = tf.reduce_sum(tf.to_int32(pos_mask))
    number_of_negatives = tf.minimum(3 * number_of_positives,
                                    tf.shape(pos_mask)[1] - number_of_positives)
    normalizer = tf.to_float(tf.add(number_of_positives, number_of_negatives))
    tf.summary.scalar('batch/size', normalizer)
    num_pos_float = tf.to_float(number_of_positives)

    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=confidence,
                                                                   labels=classes_ph)
    pos_class_loss = tf.reduce_sum(tf.boolean_mask(cross_entropy, pos_mask))
    tf.summary.scalar('loss/class_pos', pos_class_loss / num_pos_float)
    top_k_worst, top_k_inds = tf.nn.top_k(tf.boolean_mask(cross_entropy, neg_mask),
                                        number_of_negatives)
    neg_class_loss = tf.reduce_sum(top_k_worst)
    class_loss = (neg_class_loss + pos_class_loss) / num_pos_float
    tf.summary.scalar('loss/class_neg', neg_class_loss / tf.to_float(number_of_negatives))
    tf.summary.scalar('loss/class', class_loss)

from blitznet.

dvornikita avatar dvornikita commented on July 24, 2024

Sorry, I didn't get you question.

from blitznet.

fastlater avatar fastlater commented on July 24, 2024

@Engineering-Course I commented this issue and I was waiting to get my new gpu so I could continue testing using a higher batch size. However, in the meanwhile I tried with learning rate = 0 and batch_size = 1 and this error still coming out. @dvornikita Does it means that this error will come out if the batch size is not large enough? Does the code could be change a little to skip this error as mentioned @Engineering-Course ? I understand that batch size = 1 will have this error normally but I was thinking that batch size = 4 should get at least a non error result.

from blitznet.

Engineering-Course avatar Engineering-Course commented on July 24, 2024

If the batch size is small, it is more likely that the random cropped batch doesn't contain any objects, which means there are no positive samples.
When it happens, the number_of_positives is zero.

       number_of_positives = tf.reduce_sum(tf.to_int32(pos_mask))

Then number_of_negatives becomes zero too.

        number_of_negatives = tf.minimum(3 * number_of_positives,
                                        tf.shape(pos_mask)[1] - number_of_positives)

It will thus cause an error in tf.nn.top_k function.

        top_k_worst, top_k_inds = tf.nn.top_k(tf.boolean_mask(cross_entropy, neg_mask),
                                            number_of_negatives)

So I recommend to appened a line to set number_of_negatives at least one before calling tf.nn.top_k function.

        number_of_negatives = tf.maximum(1, number_of_negatives)

These codes are in detection_loss function in training.py

from blitznet.

fastlater avatar fastlater commented on July 24, 2024

@Engineering-Course let me know when you test it, and if you overcome the error. As I mentioned, I cannot do it by myself right now because my gpu is not good and I cannot even set batch size to 2.

from blitznet.

Engineering-Course avatar Engineering-Course commented on July 24, 2024

You can try it even when batch size is 1. It works for me.

from blitznet.

dvornikita avatar dvornikita commented on July 24, 2024

@fastlater, The solution of @Engineering-Course should work fine. Just note that in this case you learn from only one negative example, normalizing the loss by one, which gives the loss of the same order of magnitude as usual but the signal is not very desirable. This might bias the training, especially if this situation comes up often. So in addition to that, I would multiply the loss by zero if this occurs.

from blitznet.

Engineering-Course avatar Engineering-Course commented on July 24, 2024

Yes. Agree with you.

from blitznet.

fastlater avatar fastlater commented on July 24, 2024

@dvornikita @Engineering-Course Thank you for feedback. Thus, I will have to multiply the neg_class_loss by zero? Is that loss that you are talking about? or it is the class_loss? Let me know if you will add this lines to the code. I guess it will be good, at least for testing the training.

from blitznet.

dvornikita avatar dvornikita commented on July 24, 2024

Pushed that modification. You can test it.

from blitznet.

fastlater avatar fastlater commented on July 24, 2024

@dvornikita @Engineering-Course I tried it just for testing with batch_size=1, and the error now is: Nan in summary histogram for: summarize_grads/ssd/confidence/ssd_back/block_rev2/weights_gradiant. This error came out when you tested the code? PD.:I only modified the training.py.

from blitznet.

dvornikita avatar dvornikita commented on July 24, 2024

@fastlater I fixed this in the last commit. Apparently, the error was caused by bbox_loss since it uses the smooth l1 loss that also breaks with no positives. Now the training doesn't break but I didn't manage to make it learn something meaningful with the batch size of 1, which is not so surprising.

from blitznet.

fastlater avatar fastlater commented on July 24, 2024

@dvornikita As we all expected, it wont learn something meaningful. However, it was good to fix it.

from blitznet.

clxia12 avatar clxia12 commented on July 24, 2024

@Engineering-Course where is the training.py. I meet the same error too. but i can't find this file in my folder. can you tell me where it is in detial.

from blitznet.

dvornikita avatar dvornikita commented on July 24, 2024

@clxia12 It's in the root folder of the project

from blitznet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.