Thanks for your improvement of this YOLOv3 implementation. I have just test the tr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Resume training from official yolov3 weights,about ultralytics/yolov3

Comments (54)

Ricardozzf commented on May 22, 2024 11

@lianuo

same issue, i change cls to 2(bg and person), it looks like cls confidence have sth wrong in training

from yolov3.

xyutao commented on May 22, 2024 7

No warm-up process found for SGD. According to the paper of YOLO9000 and the official code, we need to warm-up the first 1000 iterations to make it better converge:
warmup_lr = lr * batch_size / burn_in, where lr = 1e-3, batch_size = 64 and burn_in = 1000

from yolov3.

glenn-jocher commented on May 22, 2024 4

@nirbenz @okanlv @deeppower @okanlv @xiao1228 Good news I think. I thought about the problem a bit and decided that the loss terms needed rebalancing. In my last plot you can see Classification is consuming the great majority of the loss, which means that it is being optimised at the expense of all the other losses. Ideally the 6 losses would be roughly equal in magnitude so that they are all optimised with equal priority.

So I made a commit that multiplied Objectness loss by 10, and divided Classification loss by 10:

yolov3/models.py

Lines 166 to 176 in e04bb75

 if nM > 0: 

 lx = k * MSELoss(x[mask], tx[mask]) 

 ly = k * MSELoss(y[mask], ty[mask]) 

 lw = k * MSELoss(w[mask], tw[mask]) 

 lh = k * MSELoss(h[mask], th[mask]) 

 # lconf = k * BCEWithLogitsLoss(pred_conf[mask], mask[mask].float()) 

 lconf = (k * 10) * BCEWithLogitsLoss(pred_conf, mask.float()) 

 lcls = (k / 10) * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1)) 

 # lcls = k * BCEWithLogitsLoss(pred_cls[mask], tcls.float())

I ran this for most of the day on GCP, and after about 10 epochs I overlaid the 3 different trainings I'd done. This new approach seems vastly better, in particular at increasing Recall compared to before. I thought this was exciting enough to post the news right away, I'll have to train for another week to get to 70+ epochs and see the true effect. I'm wondering if there isn't a better way to more automatically balance these 6 equally important loss terms. They seem roughly equal now after 10 epochs, but maybe theres a way to update the balancing terms every epoch with the previous epochs gains. Any ideas?

UPDATE 1: mAP is 0.43 (-conf_thresh 0.20) at epoch 20. Updated plots below (green).
UPDATE 2: mAP is 0.46 (-conf_thresh 0.20) at epoch 35. Updated plots below (green).
UPDATE 3: mAP is 0.46 (-conf_thresh 0.30) at epoch 49 :( Jumps in loss observed during training, possibly due to many restarts of preemtable GCP VM. New commit 45c5567 to run test.py after each training epoch commit and record training mAP to results.txt. Starting new training from scratch using PyTorch 1.0 on GCP. Will post new comment when new results start coming in.

from yolov3.

glenn-jocher commented on May 22, 2024 2

@okanlv GOOD NEWS! I tested the first 10 epochs with randomly vs darknet53.conv.74 initialised weights, and the darknet53.conv.74 initialization produces much better results. I will continue training darknet53.conv.74-initialized version up to 68 epoch over the coming week to see how it does. The latest commit automatically initializes yolov3 with darknet53.conv.74 when training from scratch.

@nirbenz yes issue #9 had someone run the official COCO mAP code on this repo, but I was not able to get get a pull-request from him to update the repo. #9 (comment)

from yolov3.

glenn-jocher commented on May 22, 2024 1

@xyutao I've switched from Adam to SGD with burn-in (which exponentially ramps up the learning rate from 0 to 0.001 over the first 1000 iterations) in commit a722601:

yolov3/train.py

Lines 115 to 120 in a722601

 # SGD burn-in 

 if (epoch == 0) & (i <= 1000): 

 power = 4 

 lr = 1e-3 * (i / 1000) ** power 

 for g in optimizer.param_groups: 

 g['lr'] = lr

Unfortunately this caused width and height loss terms to diverge when training from scratch. I saw that these are the only unbounded outputs of the network (all the rest are sigmoided), so I was forced to sigmoid them and create new width and height calculations, after which the training converged. The original and updated ones I made in this commit are:

yolov3/models.py

Lines 121 to 131 in a722601

 # Width and height (yolo method) 

 # w = p[..., 2] # Width 

 # h = p[..., 3] # Height 

 # width = torch.exp(w.data) * self.anchor_w 

 # height = torch.exp(h.data) * self.anchor_h 

 # Width and height (power method) 

 w = torch.sigmoid(p[..., 2]) # Width 

 h = torch.sigmoid(p[..., 3]) # Height 

 width = ((w.data * 2) ** 2) * self.anchor_w 

 height = ((h.data * 2) ** 2) * self.anchor_h

If I plot both of these in MATLAB it looks like the lack of a ceiling on the original code is causing the divergence problem. It may be that the original width/height equations are incorrect. Does anyone know where to find the original width and height darknet calcuations?

>> x=linspace(-3,3);
>> y1 = exp(x);
>> y2 = ((logsig(x) * 2).^2);
>> fig; plot(x,y1,'.-'); plot(x,y2,'.-'); h=gca; h.YLim=[0,5]; legend('original','updated'); xyzlabel('network output','anchor width multiple'); fcnfontsize(14)

from yolov3.

glenn-jocher commented on May 22, 2024 1

@lianuo @Ricardozzf @xyutao @CF2220160244 @jaelim I have good news. A significant bug in the loss function was found today in issue #12, namely a problem size_average-ing the various loss terms. This caused the lconf_obj term to be 80 times too large (80 = COCO class count), which caused the network to over-detect objects, which I believe was the major problem many of you saw in your training.

I fixed this in commit cf9b4cf, and after the change observed that SGD with burn-in now converges with the original YOLO width/height calculations, so I placed those back in in commit 5d402ad.

Update: Sorry guys I think I might have spoken too soon. The changes help, but resuming training from yolov3.pt still causes P and R to drop from initially high values to lower values after ~50 batches. I think we are getting closer to the source of the problem however, which I feel is in the model loss term somewhere. TODO: I also need to ignore non-best anchors with > 0.50 iou to match yolov3.

from yolov3.

libzzluo commented on May 22, 2024 1

@TreB1eN
SURE.... The final mAP (416 416 55.0mAP 160 epochs) with weight trained from scratch is a bit lower than yolov3.weight ( 416*416 55.3). That's my training result.

from yolov3.

xiao1228 commented on May 22, 2024 1

@libzzluo hi did you train from scratch to achieve the result? What did you change? Because i am getting weird results see #22
thank you in advance!

from yolov3.

glenn-jocher commented on May 22, 2024 1

@nirbenz @okanlv @deeppower @okanlv @xiao1228 I've started running studies to improve the COCO map when training from darknet53.conv.74. I started with the #2 (comment) model that gets 0.46 mAP at epoch 35. The primary breakthrough there was simply rebalancing the loss terms, multiplying lconf = (k * 10)... * and dividing lcls = (k / 10) * ... to get that 0.46 mAP.

All tests below are only run for the first epoch. Freezing the darknet53 layers (just for the first epoch) showed slightly positive results. It seems further rebalancing the loss terms has the biggest effect. In most ML regression problems the inputs and targets are always recalibrated to zero mean and unity variance, yolov3 does this for the inputs via batch_norm layers but does not do this for the regression targets (the bounding boxes), so I want to try this (regression problems that fail to do this have far worse performance).

Any other experiments you guys want let me know. I'll keep populating this comment as my results come in over the next week.

	mAP (epoch 0)	Precision	Recall
default #2 (comment)	0.168	0.200	0.175
... + `weight_decay=0`	0.169	0.200	0.176
... + darknet53 frozen	0.172	0.210	0.179
... + `lconf*16`	0.181	0.214	0.188
... + `lcls/4`	0.231	0.268	0.243
... + dkn53 unfrozen + `lconf*32`	0.237	0.263	0.25
... + `lconf*64`	0.225	0.249	0.235
... + bbox targets normalization
... + additional experiments?

This is my selected configuration, + lconf*64 in the above table, and in the latest commit. darknet53 is not frozen in the first epoch, as I found this hurts later epochs. I'm now training to about 50 epochs.
mAP is 0.45 (-conf_thresh 0.30) at epoch 12
mAP is 0.48 (-conf_thresh 0.30) at epoch 17
mAP is 0.50 (-conf_thresh 0.30) at epoch 45 (jumps in losses, not sure why again)
mAP is 0.522 (-conf_thresh 0.30 at img_size 416) at epoch 62 (max mAP achieved)

from yolov3.

lianuo commented on May 22, 2024

The loss is down, so I wounder whether the definition of loss lead to this problem?

from yolov3.

glenn-jocher commented on May 22, 2024

@lianuo @Ricardozzf I've not tried to continue training from the official yolov3 weights. It probably won't pick up smoothly where Joseph Redmon and company left off for a number of reasons, such as the optimizer starting with no knowledge of the previous optimizer's momentum and LR. There are also a few primary differences between my training and the official darknet training:

Issue #4: train.py uses the Adam optimizer in place of SGD. I could not get SGD to converge with the yolov3 learning rate.
Non Maximal Suppression (NMS) is not applied during training, so precision may appear artificially low while training, as many of the False Positives (FPs) in the denominator term P = TP / (TP + FP) are eliminated during testing but not training.
Issue #3: I use CrossEntropyLoss in place of BinaryCrossEntropyLoss for classification loss during training. I made this change after observing better performance with CE vs BCE (I don't understand the reason for this, as darknet uses BCE). These two loss terms are on line 162 and 163 of models.py. Note that BCEWithLogitsLoss that I use produces the same loss as BinaryCrossEntropyLoss + torch.sigmoid() on the first term, but BCEWithLogitsLoss is preferable for numerical stability reasons. If you want to try to continue training from yolov3.weights you need to use BinaryCrossEntropy or BCEWithLogitsLoss as in the commented line below.

lcls = nM * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1))
# lcls = nM * BCEWithLogitsLoss2(pred_cls[mask], tcls.float())

@lianuo how many epochs did you train this way? If you make the switch the BCE does this help?

@Ricardozzf your results don't look good. Are you training from scratch or resuming training from yolov3 weights like @lianuo? If you suspect class confidence has a problem it must be because I've swapped CE for BCE. You can switch BCE back on by switching the commented lines above. But also note that if you are training from scratch you need significant number of epochs before things start looking good. In my training I see about 0.50 mAP on COCO2014 validate set after 40 epochs (3 days of training on a 1080 Ti).

from yolov3.

lianuo commented on May 22, 2024

@glenn-jocher Thank you for reply!
I just try resume training from official yolov3 weights with
optimizer = torch.optim.SGD(model.parameters(), lr=.0001, momentum=.9, weight_decay=5e-4, nesterov=True)
and switch to BCEWithLogitsLoss
the precision is down to 0.18 and recall grow to 0.6.just like previous settings.

It is strange , that when I run test.py with this trained weight , I can still have high sore,see the screamshot:

but when I run detect.py with this trained weight. the result is still not good.like this:

Is this because of the method of evaluate mAP?

from yolov3.

lianuo commented on May 22, 2024

@glenn-jocher have you use the weights you trained (0.50 mAP on COCO2014) to test a image?
could you share the weight or the test result of images?
it is a little strange that the score is high while the image testing result is not good...
Thank you so much for reply

from yolov3.

lianuo commented on May 22, 2024

@Ricardozzf thanks for you information.I am not alone ,haha.

from yolov3.

Ricardozzf commented on May 22, 2024

@glenn-jocher thanks for your reply
i have trained the model from scratch for 14 epochs on a TITAN X. In order to make full use of GPU, i chaged batch_size from 12 to 16, and other conf is default.
In training, the model looks good:

In testing, I use crowdhuman dataset, the score is high

Although the score in training and testing is high, the result processed by detect.py is bad, maybe one thing could be confirmed, testing score didn't match results of detect.py

I hope the information is useful to us.

from yolov3.

glenn-jocher commented on May 22, 2024

@lianuo @Ricardozzf thats a good question, I will compare my test.py and detect.py results. I am at epoch 37 training on COCO2014. If I run test.py I see this:

+ Sample [4998/5024] AP: 0.7528 (0.4926)
+ Sample [4999/5024] AP: 0.8333 (0.4927)
+ Sample [5000/5024] AP: 0.5543 (0.4927)
Mean Average Precision: 0.4927

If I then use the epoch 37 checkpoint latest.pt with detect.py I see this on my example image, which is the same problem you guys are seeing.

I'm wondering if I caused this by switching from BCE to CE. In xView when I used this code I had to increase my -conf_thresh in detect.py to ~0.99 to reduce FP. If I increase -conf_thresh to 0.99 now (and change -nms_thresh to 0.45 to match test.py) then I get this. Better, but still not quite right.

This is a bit apples and oranges comparison though. The official weights are at 160 epochs and my latest.pt is only at 37 epochs, so its possible that training up to 160 will resolve this problem.

I don't understand why test.py is producing such a high mAP though, especially since it uses a very low -conf_thresh of 0.5. You guys are right, there is an unresolved issue somewhere. I will try and investigate more. The problem seems twofold:

Issue #5: test.py is possibly over-reporting mAP on trained checkpoints, even though it correctly reports mAP on the official YOLOv3 weights, an odd inconsistency. This seems to be the easiest issue to resolve, so I'll look at this first.
Trained weights seem to require much higher confidence thresholds (~0.99) than typically used in YOLOv3 (~0.8 commonly). This would seem to be unrelated to the CE vs BCE issue, as @lianuo trained from epoch 160 using BCE and still saw poor results.

Any ideas are appreciated as well!

from yolov3.

glenn-jocher commented on May 22, 2024

@lianuo @Ricardozzf the overly-high mAPs you were seeing before should be partly fixed in the latest commits, which fixed mAP calculations (see issue #7). The official weights now produce .57 mAP, but the trained weights that before gave me 0.50 mAP now return about 0.13 mAP, much more in-line with the poor boxes you see in your images.

I still don't understand the actual cause of the poor training results however.

from yolov3.

lianuo commented on May 22, 2024

@glenn-jocher Thank you for reply~

from yolov3.

lianuo commented on May 22, 2024

@glenn-jocher the loss is still decrease when training ,do you think the loss function need to modify?

from yolov3.

xyutao commented on May 22, 2024

@glenn-jocher The usage of CrossEntropyLoss might be incorrect. The input shape is (nB, nA, nG, nG, nC), but the pytorch-doc suggests it to be (nB, nC, ...). (See

yolov3/models.py

Line 115 in 9514e74

 p = p.view(bs, self.nA, self.bbox_attrs, nG, nG).permute(0, 1, 3, 4, 2).contiguous() # prediction 

)
Besides, the torch.argmax(tcls, 1) fetches C from dim=1, but the shape of tcls is actually (nB, nA, nG, nG, nC). Maybe we need to permutate the dims so that C is at dim=1 .

from yolov3.

glenn-jocher commented on May 22, 2024

@xyutao I looked into the CELoss function, I think this part is ok. When I start training and debug this spot, the dimensions look good (assuming nC = 80 and assuming we have 47 targets here in the first batch of nB=12 images). I think mask is eliminating all the other dimensions:

tcls.shape
Out[2]: torch.Size([12, 3, 13, 13, 80])

tcls = tcls[mask]
tcls.shape
Out[3]: torch.Size([47, 80])

lcls = nM * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1))
Out[4]: tensor(206.37325, grad_fn=<MulBackward1>)

pred_cls[mask].shape
Out[5]: torch.Size([47, 80])

torch.argmax(tcls, 1).shape
Out[6]: torch.Size([47])

I linked to your comment on the SGD warmup however, this is a good catch! Issue #4 is open on this. By the first 1000 iterations do you mean the first 1000 batches?

from yolov3.

xyutao commented on May 22, 2024

@glenn-jocher Yeah, the first 1000 batches of batch_size=64.

from yolov3.

CF2220160244 commented on May 22, 2024

please help,i have the same error,
did you guys solve this problem?thanks!

from yolov3.

jaelim commented on May 22, 2024

@lianuo Hi, just wondering how you loaded a pre-trained weights. Did you add this line of code in train.py?

    # Initialize model 
    model = Darknet(opt.cfg, opt.img_size) 
    model.load_weights(opt.weights_path)

from yolov3.

jaelim commented on May 22, 2024

@lianuo I found out from detect.py that you add this line:

load_weights(model, weights_path)

But, now, I'm getting a different error from datasets.py:

Have you encountered this problem; if yes, how do you deal with it?

from yolov3.

glenn-jocher commented on May 22, 2024

@jaelim you resume training from a trained model (i.e. latest.pt) by setting opt.resume = True:

yolov3/train.py

Lines 50 to 53 in 68de92f

 if opt.resume: 

 checkpoint = torch.load('checkpoints/latest.pt', map_location='cpu') 

 model.load_state_dict(checkpoint['model'])

If you are seeing the error you mentioned it is because you failed to define a proper path to an image, or image folder in detect.py line 14 (no images are loaded). Make sure there are only image files in the path if you specify a path. Also please do not ask questions unrelated to the main issue title in this thread.

yolov3/detect.py

Line 14 in 68de92f

 parser.add_argument('-image_folder', type=str, default='data/samples', help='path to images') 

from yolov3.

lianuo commented on May 22, 2024

@glenn-jocher great work , you are approaching the truth~
recently I test Andy's solution, it could resume training from original weight which could maintain the high Pre and Recall, may be you can find something useful from his code https://github.com/andy-yun/pytorch-0.4-yolov3

from yolov3.

glenn-jocher commented on May 22, 2024

@lianuo yes this is the ultimate test. If the repo loss terms are perfectly aligned to darknet then the P and R terms should not degrade once you continue training from the official weights. More work to do, but I think it's getting closer.

from yolov3.

glenn-jocher commented on May 22, 2024

@lianuo Andy-yun has a very different loss function. This would be easy to implement, but I don't understand several parts of it, which seem incorrect. I've raised an issue on his repo to get some answers, such as the 3 questions below. By the way, even if the loss is perfectly equal to darknet, if the learning rate is not perfectly aligned from darknet epoch 160, P and R will start to drop, so we need to know exactly the darknet learning rate at epoch 160 (UPDATE: see issue #18, it should be impossible to resume training with no P and R loss as final darknet lr = 0).
andy-yun/pytorch-0.4-yolov3#22

I think you should use BCELoss for loss_cls, as the YOLOv3 paper section 2.2 clearly states "During training we use binary cross-entropy loss for the class predictions."
Why is MSELoss used in place of BCELoss for loss_conf? Did you make this choice yourself or did you see this in darknet?
Why divide loss_coord by 2?

https://github.com/andy-yun/pytorch-0.4-yolov3/blob/master/yolo_layer.py#L161-L164

loss_coord = nn.MSELoss(size_average=False)(coord*coord_mask, tcoord*coord_mask)/2
loss_conf = nn.MSELoss(size_average=False)(conf*conf_mask, tconf*conf_mask)
loss_cls = nn.CrossEntropyLoss(size_average=False)(cls, tcls) if cls.size(0) > 0 else 0
loss = loss_coord + loss_conf + loss_cls

from yolov3.

TreB1eN commented on May 22, 2024

So the training is still not figured out ?
by the way, is it really possible to train it from scratch without a pretrained Imagenet weights ?

from yolov3.

glenn-jocher commented on May 22, 2024

@libzzluo so you trained COCO 2014 from scratch with this repo? This is wonderful news!! I haven't had time to get to 160 epochs so I wasn't sure if the training code was mature or not (mostly was unsure about the loss function and the optimizer).

Do you know which exact commit you used to achieve these results (and did you make any changes to get it to work?) Thanks!!

from yolov3.

glenn-jocher commented on May 22, 2024

All, I trained to 60 epochs using the current setup. I used batch size 16 for the first 24 hours, then reverted to batch size 12 accidentally for the rest (hence the nonlinearity at epoch 10). A strange hiccup happened at epoch 40, then learning rate dropped from 1e-3 to 1e-4 at epoch 51 as part of the LR scheduler. This seemed to produce much accelerated improvements in recall during the last ten epochs. The test mAP at epoch 55 was 0.40 with conf_thresh = 0.10, so I feel if I continued training until perhaps epoch 100 we might get a very good mAP, especially seeing the Recall improving so well during epochs 51-60.

The strange thing is that I had to lower conf_thresh to 0.10 to get this good (0.40) mAP, otherwise I see 0.20 mAP at the default conf_thresh = 0.50. I am going to restart training with a constant 16 batchsize, and hopefully the epoch 40 hiccup does not repeat.

from yolov3.

nirbenz commented on May 22, 2024

@glenn-jocher Interesting. Is this with the default COCO mAP calculation or with the one in the repository? Because they have quite a large difference.

from yolov3.

nirbenz commented on May 22, 2024

@glenn-jocher By the way - is this with the BCE or CE for the loss functions (BCE being equivalent to original implementation)? Are there any other architectural changes?

from yolov3.

glenn-jocher commented on May 22, 2024

@nirbenz This is with one BCE loss for all anchors. I'm surprised this works, since nearly all anchors are 0, with only a few 1's, but this seems to be how the darknet loss function is, since resuming training works well like this.

The mAPs are calculated from test.py in this repo. The test.py code is closer to the official code now, after I made some changes about 2 months ago. The official weights produce 0.57 mAP using test.py (at a 0.5 conf_thresh). I'm going to retrain on GCP for about 100 epochs and see where the mAP goes. Should take about a week or so.

from yolov3.

okanlv commented on May 22, 2024

@glenn-jocher Why are you training from random weights? Darknet initially loads darknet53 weights, then the training starts (refer #6)

from yolov3.

glenn-jocher commented on May 22, 2024

@okanlv yes you are right, I should try training from darknet53 weights. I've downloaded the weights, but I don't have a simple way to load them into the randomly-initialized yolov3 model right now... have you done this before? I can try and develop a new function to handle this if not.

from yolov3.

okanlv commented on May 22, 2024

@glenn-jocher Yes, I have implemented it. I can send a pull request if you wish.

from yolov3.

glenn-jocher commented on May 22, 2024

UPDATE: I see the existing load_weights() function works fine in models.py for darknet53, I just need to implement a smart cutoff so it doesn't attempt to load layers past 74 if presented with a darknet53.conv.74 weights file. I'll implement this in a new commit. Ok, I've added a few lines to train.py to find and load darknet53 weights if not resuming training. I'll start this on GCP and see how it goes:

yolov3/train.py

Lines 76 to 80 in 741626c

 else: 

 # Initialize model with darknet53 weights (optional) 

 if not os.path.isfile('weights/darknet53.conv.74'): 

 os.system('wget https://pjreddie.com/media/files/darknet53.conv.74 -P weights') 

 load_weights(model, 'weights/darknet53.conv.74')

from yolov3.

okanlv commented on May 22, 2024

@glenn-jocher Great, waiting for the training results

from yolov3.

nirbenz commented on May 22, 2024

@glenn-jocher since your mAP code is still a bit different from MS-COCO code (which among other things takes object sizes into account) I wonder if you (or anyone else) tested this repository's results against the pycocotools test code.

from yolov3.

nirbenz commented on May 22, 2024

Well then. I get these numbers, using official COCO SDK. Notes:

They are damn close to original paper
They do require quite different thresholds compared to original paper which is strange; these results are using confidence=0.01, nms=0.45, iuo=0.5
608x608 is very close while 416x416 is a bit less so

608

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.326
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.571
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.335
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.189
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.354
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.430
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.278
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.416
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.434
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.276
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.459
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.551

416

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.543
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.311
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.081
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.285
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.443
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.264
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.393
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.409
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.161
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.404
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.536

I will PR the code used to make this once I clean it up a bit.

from yolov3.

deeppower commented on May 22, 2024

@glenn-jocher Hi! I tried darknet53.conv.74 initialised weights when training. But TP and FP became 0 after hundreds of iterations gradually.

from yolov3.

glenn-jocher commented on May 22, 2024

@deeppower yes this is normal. The first 1000 epochs are burn-in (see #2 (comment)), where the LR slowly ramps from 0 to its full initial value of 1E-3. After batch 1000 then training proceeds normally, and you should see TPs start to appear in epoch 1 and increase steady from there onward. The majority of epoch 0 shows almost no TPs (this is normal). Full training may take up to 70 or more epochs remember (at least a week of COCO training on a 1080Ti).

@nirbenz those are veryyy close to the official values! I get 0.57 mAP on yolov3.pt using test.py with the default parameters confidence=0.50, nms=0.45, iuo=0.5. BUT I noticed that yes reducing the confidence threshold massively helps the test.py mAP. I found that checkpoints trained using this repo usually show the highest mAP around 'confidence = 0.10'. It would be really cool to plot mAP vs confidence threshold.

from yolov3.

nirbenz commented on May 22, 2024

@glenn-jocher Yes, they are! Good news indeed. I have also noticed that different implementations require different tweaking of thresholds. The Keras YOLOv3 implementation also requires a different threshold to get equivalent results to the Darknet one.

from yolov3.

deeppower commented on May 22, 2024

@glenn-jocher I have finished training darknet53.conv.74-initialized version to 68 epoch. But the mAP of the latest.pt is only 0.2387 using test.py. Here is the curves of losses.

How is your result?

from yolov3.

sporterman commented on May 22, 2024

@deeppower Have you trained your own dataset on this code? i meet some trouble while training my data, the precision and recall keep low and never change .

from yolov3.

glenn-jocher commented on May 22, 2024

@deeppower thanks for the feedback. You need to vary conf_thresh to get the best mAP. Usually I test values between 0.01 - 0.50. In the current repo the best value seems to be around 0.1 - 0.2. For example if you run this you should get a higher mAP, around 0.40 I think (but yes still lower than what we want): python3 test.py -img_size 416 -weights_path weights/latest.pt -conf_thres 0.10

My results look like this, comparing random initialization vs darknet53.conv.74 initialization. Your results look much smoother than mine. My training is on GCP preemptible instances which stop every 24 hours, or about every 10 epochs. I think this is causing spikes in my losses, which is very frustrating because theoretically the training should resume with no breaks at all (the model and the optimizer states are both saved and then replaced, so I don't understand my spikes... possibly a pytorch issue).

mAP is 0.42 at conf_thresh = 0.20 at epoch 80. I will start some multiscale training here.

from yolov3.

deeppower commented on May 22, 2024

@sporterman Sorry, i haven't trained my own dataset.

from yolov3.

deeppower commented on May 22, 2024

@glenn-jocher Thanks for your reply and great work. I have varied conf_threshto 0.2, and mAP is 0.41 at epoch 68. There are still some problems that we need to solve.

from yolov3.

glenn-jocher commented on May 22, 2024

@deeppower yes the performance is still not as good for training as darknet unfortunately. I tried a few epochs of multi_scale training after epoch 80 and this did not seem to help. I've tried to align everything as closely as possible to darknet, so for example if you resume training from the official yolov3.pt weights the P and R values are very steady (though still dropping slightly over time). This makes me think the loss function is correct, or at least very close to the original darknet loss function. Inference works well, so the problem can not be there, it must be in the training-only code, which could be optimizer, LR scheduler, loss function, target building functions, IOU function, augmentation function...

from yolov3.

glenn-jocher commented on May 22, 2024

@deeppower Yes, objectness loss is higher than before because I multiplied it by 10x now. I'm trying to balance the loss terms so they contribute equally to the gradient, or else the largest loss terms will get optimized at the expense of the smaller loss terms. It appears to be working, though my loss term multiples are rather arbitrary unfortunately.

Ideally we want to take this a step further, and better equalize not just the loss terms, but the target distributions to something like zero mean and unity variance, which helps regression networks at least (not sure about object detection). Any experiments you can run on your own would help significantly, I'm just one man with one GPU here, so I can only try a finite sets of things to improve the results.

from yolov3.

glenn-jocher commented on May 22, 2024

@okanlv I have a question for you. Now that I've defaulted to start training from darknet53.conv.74, would it make sense to freeze those layers for a bit of time before allowing them to change?

I was thinking I could freeze them for the first epoch perhaps, which would be 7328 batches, or half epoch at least. The first 1000 batches are burn in. I feel like it would make sense to do this since the randomly initiated layers might converge must faster without the darknet53.conv.74 layers changing underneath them.

from yolov3.

okanlv commented on May 22, 2024

@glenn-jocher In the darknet repo, all layers are trained together after yolov3 is initialized with darknet53.conv.74 weights. In this paper, the authors have showed that updating the parameters of all the layers increases the performance compared to updating the parameters of only the top layers (related to fragile coadaptation of the layers, mentioned in the paper). That being said, your method might also work, because there are a few differences between your approach and the experiments in the paper. If you train yolov3 with your approach, could you share the loss graphs including your approach and the current method? It could be beneficial for further experiments.

from yolov3.

Resume training from official yolov3 weights about yolov3 HOT 54 CLOSED

Comments (54)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	if nM > 0:
	lx = k * MSELoss(x[mask], tx[mask])
	ly = k * MSELoss(y[mask], ty[mask])
	lw = k * MSELoss(w[mask], tw[mask])
	lh = k * MSELoss(h[mask], th[mask])

	# lconf = k * BCEWithLogitsLoss(pred_conf[mask], mask[mask].float())
	lconf = (k * 10) * BCEWithLogitsLoss(pred_conf, mask.float())

	lcls = (k / 10) * CrossEntropyLoss(pred_cls[mask], torch.argmax(tcls, 1))
	# lcls = k * BCEWithLogitsLoss(pred_cls[mask], tcls.float())

	# SGD burn-in
	if (epoch == 0) & (i <= 1000):
	power = 4
	lr = 1e-3 * (i / 1000) ** power
	for g in optimizer.param_groups:
	g['lr'] = lr

	# Width and height (yolo method)
	# w = p[..., 2] # Width
	# h = p[..., 3] # Height
	# width = torch.exp(w.data) * self.anchor_w
	# height = torch.exp(h.data) * self.anchor_h

	# Width and height (power method)
	w = torch.sigmoid(p[..., 2]) # Width
	h = torch.sigmoid(p[..., 3]) # Height
	width = ((w.data * 2) ** 2) * self.anchor_w
	height = ((h.data * 2) ** 2) * self.anchor_h

	if opt.resume:
	checkpoint = torch.load('checkpoints/latest.pt', map_location='cpu')

	model.load_state_dict(checkpoint['model'])

	else:
	# Initialize model with darknet53 weights (optional)
	if not os.path.isfile('weights/darknet53.conv.74'):
	os.system('wget https://pjreddie.com/media/files/darknet53.conv.74 -P weights')
	load_weights(model, 'weights/darknet53.conv.74')