Train RetinaNet with Focal Loss in PyTorch.
Reference:
[1] Focal Loss for Dense Object Detection
RetinaNet in PyTorch
Train RetinaNet with Focal Loss in PyTorch.
Reference:
[1] Focal Loss for Dense Object Detection
Hi @kuangliu, does this code follow the paper exactly? Also, are you able to document it sufficiently so that we can get started on it using VOC?
Could you please add some section about how to train on custom dataset. And also results that you got on standard dataset like PASCAL VOC or COCO dataset would be a great help.
It will be nice to be able to train using other standard networks as well, since FPN requires too much memory. Right now, changing the base network in retinanet.py (e.g., to VGG) does not work out of the box.
Has anybody tried and got it working?
Most of pics in voc2012_val.txt only contains person while objects of other 19 classes are not included.
Hi. Thank you for sharing your implementation.
Is it possible to get the same (or similar) mAP with the paper?
I would be really appreciated if you could tell me the final performance.
hello, I want to run test,py
and I get the model from the model zoo here, but how to load the model_final.pkl
?
My GPU memory is 2G.
Therefore, I use batch=1 to run this program.
However, I meet the error force me to stop.
What should I do to get similar result of this paper in my configuration?
Thank you~
Error:
Traceback (most recent call last):
File "./t.py", line 1098, in
train(epoch)
File "./t.py", line 1059, in train
loss = criterion(loc_preds, loc_targets, cls_preds, cls_targets)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "./t.py", line 984, in forward
print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.data[0]/num_pos, cls_loss.data[0]/num_pos), end=' | ')
ZeroDivisionError: float division by zero
Line 30 in 2d7c663
bias
should be set to 0
and the weights to a normal distribution weight fill with σ = 0.01
.π = 0.01
− log((1 − π)/π) = -4.59511985013459
forward
method. This allows for the weights to be loaded, but for the network to have a completely different output. You structured your model different, so I'm not sure if this would work for your structure.ResNet
and training on the COCO dataset.Anway, just let me know what you think and I can open a pull request for these things if you would like. I have written most of these fixes for work and would love to merge into your repo.
This is not an issue but a question.
I think the the term (1-p)^gamma and p^gamma in focal loss are for weighing only. They should not be back propagated during gradient descent. Am I correct?
If so, do you need to detach() your variables for computing the weight terms in your focal loss function?
i just cannot find the val split images in voc2012 dataset download from http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar,
where can i download the validation data?
Have anyone test this repo?
Hi, I met a question that When I want to use this project to train WIDER-Face-Detection dataset. And I have create label to meet your script like this(name.jpg xmin ymin xmax ymax), When I want to debug to see whether I am right, I saw it print the data shape from dataloader like this.
(1L, 3L, 600L, 600L)
(1L, 67995L, 4L)
(1L, 67995L)
why it becomes 67995 dimension, I found that DataSet getitem method create data shape is
(1L, 3L, 600L, 600L)
(1L, 4L)
(1L)
I did'n use pytorch before, is something wrong with pytorch? or my label file?
boxes[:,1::2].clamp_(min=0, max=h-1) The boxes ymax not compressed
请问您的代码中是没有用rpn网络吗?,感觉是直接就在最后的每个feature map的所有像素点上进行最终的分类和回归,剪掉了中间的rpn网络。
Why there are two version of focal loss methods in _"class FocalLoss(nn.Module): URL: https://github.com/kuangliu/pytorch-retinanet/blob/2199fd9711fd787ae409800a499db73e6d466fd7/loss.py" ????
Hi, Can you provide for evaluation also...to get MAP after training. Thanks
When I'm training the retinanet,cls_loss is easy to become NaN.Does anyone know the reason?
loc_loss: 0.084 | cls_loss: nan | train_loss: nan | avg_loss: nan
python train.py
==> Preparing data..
Epoch: 0
Traceback (most recent call last):
File "train.py", line 114, in
train(epoch)
File "train.py", line 75, in train
loss = criterion(loc_preds, loc_targets, cls_preds, cls_targets)
File "/home/hs/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/hs/hs/pytorch/pytorch-retinanet-580/loss.py", line 99, in forward
print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.item()/num_pos, cls_loss.item()/num_pos), end=' | ')
File "/home/hs/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 320, in rdiv
return self.reciprocal() * other
RuntimeError: reciprocal is not implemented for type torch.cuda.LongTensor
would it be better to let batch norm parameters adapt to your current data?
Lines
max_size, _ = torch.IntTensor([im.size() for im in imgs]).max(0)
max_h, max_w = max_size[1], max_size[2]
where to download pretrained FPN101 torch model
This pytorch version is quite helpful for research! I want to do a transfer learning on other datasets. Is there anyone willing to share the pretrain model on voc or coco? Thanks very much!
loss = -w*pt.log() / 2
because of this line, that loss function is numerically not stable.
for instance, pt.log() would be -inf when pt value is going to zero.
pt value is sigmoid encoding value. so it can be zero.
print('Loading model..')
net = RetinaNet()
net.load_state_dict(torch.load('./checkpoint/params.pth')['net'])
net.eval()
hello, when I run retinanet.py, it shows error:
Traceback (most recent call last): File "retinanet.py", line 58, in <module> test() File "retinanet.py", line 56, in test cls_preds.backward(cls_grads) File "/home/ztgong/local/anaconda2/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/ztgong/local/anaconda2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
how to fix it? can you give some advises? thankyou
Perhaps these four lines should not be commented out. https://github.com/kuangliu/pytorch-retinanet/blob/master/utils.py#L201-L204
flake8 testing of https://github.com/kuangliu/pytorch-retinanet on Python 2.7.13
$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics
./utils.py:211:19: F821 undefined name 'TOTAL_BAR_LENGTH'
cur_len = int(TOTAL_BAR_LENGTH*current/total)
^
./utils.py:212:20: F821 undefined name 'TOTAL_BAR_LENGTH'
rest_len = int(TOTAL_BAR_LENGTH - cur_len) - 1
^
./utils.py:223:28: F821 undefined name 'last_time'
step_time = cur_time - last_time
^
./utils.py:235:20: F821 undefined name 'term_width'
for i in range(term_width-int(TOTAL_BAR_LENGTH)-len(msg)-3):
^
./utils.py:235:35: F821 undefined name 'TOTAL_BAR_LENGTH'
for i in range(term_width-int(TOTAL_BAR_LENGTH)-len(msg)-3):
^
./utils.py:239:20: F821 undefined name 'term_width'
for i in range(term_width-int(TOTAL_BAR_LENGTH/2)):
^
./utils.py:239:35: F821 undefined name 'TOTAL_BAR_LENGTH'
for i in range(term_width-int(TOTAL_BAR_LENGTH/2)):
^
7 F821 undefined name 'TOTAL_BAR_LENGTH'
can you tell me ?
Traceback (most recent call last):
File "train.py", line 15, in
from loss import FocalLoss
File "H:\datasets\tianchi_lvcai\tianchi_lvcai_fusai\pytorch-retinanet-master\loss.py", line 7, in
from utils import one_hot_embedding
File "H:\datasets\tianchi_lvcai\tianchi_lvcai_fusai\pytorch-retinanet-master\utils.py", line 242, in
_, term_width = os.popen('stty size', 'r').read().split()
ValueError: not enough values to unpack (expected 2, got 0)
@kuangliu , i am using pytorch 0.12, but i find that in DataEncoder class, the encoder function, "boxes=boxes[max_idx]" can not work, do you mean that this operation can enlarge the size of boxes tensor to anchor size? I can not use this operation due to the different version??
Thanks
When I use this model to train coco2017, the num_classes is set to 80.But there is an error likes this:
RuntimeError: index out of range at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/TH/generic/THTensorMath.c:277
which is occured at “pytorch-retinanet/utils.py", line 230, in one_hot_embedding” return y[labels]
.
So what happened,why?How can I solve this problem?
I would like to close my fork.
where to get the initial net.pth? from model zoo?
I noticed that the encoder generated anchor proposals with dimensions larger than the input. i.e. some have width > 600px. Is this intended?
the torch version:torch 0.4.1
When i use python 2.7 to train the project ,found a problem as follow:
`` python train.py
==> Preparing data..
Epoch: 0
Traceback (most recent call last):
File "train.py", line 114, in
train(epoch)
File "train.py", line 68, in train
for batch_idx, (inputs, loc_targets, cls_targets) in enumerate(trainloader):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 336, in next
return self._process_next_batch(batch)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 357, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 106, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/wq/retinaNet_pytorch/pytorch-retinanet-master_old/datagen.py", line 120, in collate_fn
loc_target, cls_target = self.encoder.encode(boxes[i].long(), labels[i].long(), input_size=(w,h))
File "/home/wq/retinaNet_pytorch/pytorch-retinanet-master_old/encoder.py", line 78, in encode
anchor_boxes = self._get_anchor_boxes(input_size)
File "/home/wq/retinaNet_pytorch/pytorch-retinanet-master_old/encoder.py", line 52, in _get_anchor_boxes
xy = (xy*grid_size).view(fm_h,fm_w,1,2).expand(fm_h,fm_w,9,2)
RuntimeError: Expected object of type torch.LongTensor but found type torch.FloatTensor for argument #2 'other
I trained this project with voc2012 ,downloaded by myself , i tried to correct this problem, but caused some others problem similarly .
This init func in encoder.py
first sets the anchor area of the corresponding feature map (p3 --> p7):
self.anchor_areas = [32 * 32., 64 * 64., 128 * 128., 256 * 256., 512 * 512.]
and then combines with the anchor location:
wh = self.anchor_wh[i].view(1, 1, 9, 2).expand(fm_h, fm_w, 9, 2)
box = torch.cat([xy, wh], 3)
I do think the anchor areas should be adjusted by the actual object size, especially when the input image is small. Given that we encode the boxes in advance, we should take care of the setting of anchor areas.
Is this understanding right?
Is it required to install specific pytorch or do some trick to get rid of this?
==> Preparing data..
Epoch: 0
loc_loss: 0.116 | cls_loss: 3791.763 | train_loss: 3791.878 | avg_loss: 3791.878
loc_loss: 0.088 | cls_loss: 1283.638 | train_loss: 1283.725 | avg_loss: 2537.802
loc_loss: 0.093 | cls_loss: 8380.014 | train_loss: 8380.107 | avg_loss: 4485.237
loc_loss: 0.095 | cls_loss: 2.312 | train_loss: 2.407 | avg_loss: 3364.530
Traceback (most recent call last):
File "train.py", line 114, in
train(epoch)
File "train.py", line 75, in train
loss = criterion(loc_preds, loc_targets, cls_preds, cls_targets)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/xiangyong/Workbench/pytorch-retinanet-kuangliu/loss.py", line 92, in forward
cls_loss = self.focal_loss_alt(masked_cls_preds, cls_targets[pos_neg])
File "/home/xiangyong/Workbench/pytorch-retinanet-kuangliu/loss.py", line 60, in focal_loss_alt
return loss.sum()
RuntimeError: value cannot be converted to type double without overflow: inf
Hey guys, I'm super busy the two weeks. Finally I get some time working on this.
For now, let's fix the issue one by one.
@kuangliu @Mendel1 In the encoder file, the output of "get_anchor_boxes" is the "xcenter,ycenter, xwidth, ywidth" format, it seems that it does not need to change to xxwh(I guess you mean
xywh
) using change_box_order function?
anchor_boxes
is ordered as xywh
,
boxes
is changed from xyxy
to xywh
with change_box_order
:
boxes = change_box_order(boxes, 'xyxy2xywh')
Now they are both xywh
. Any problems?
@kuangliu I am using a titian X graphic card with 12G memory , somehow, when i use the bach_size=2, there are still out or memory, may i ask the configuration of hardware and software you are using? thanks
In F.smoothL1loss, if "size_average" is True, loss / (samples_num * location_vector_length), you set "size_average" to False, but only devided by samples_num. Why is it?
When I trained with the most recent version(commit : fda946, using focal_loss_alt()), Loss is stiil Nan.
same result as using the previous focal_loss()...
Is there any additional settings ?
Is there any solution ?
Thank you.
@kuangliu Could you please tell me why you use log_softmax to compute the focal loss instead of the sigmod layer mentioned in the paper?Or I made a mistake in understanding?
Hi,
I have implemented your code and it worked properly but have the following concerns
My sudo code works like this
cls_targets = [batch_size, anchor_boxes, classes] # classes is 21 (voc_labels+background) [16, 67995, 21]
cls_preds = [batch_size, anchor_boxes] # anchor_boxes number ranges from -1 to 20 [67995, 21]
Now I remove all the anchor boxes with -1 (ignore_boxes)
cls_targets = [batch_size * valid_anchor_boxes, classes] # [54933, 21]
cls_preds = [batch_size * valid_anchor_boxes, classes] # [54933, 21] This is one hot encoding vector
Now, I followed your code and implemented focal loss as it is but My loss values are coming very less. Like random values is giving a score of 0.12 and quickly the loss is going 0.0012 and small
is der I am missing something:
class FocalLoss_tensorflow(nn.Module):
def __init__(self, num_classes=20,
focusing_param = 2.0,
balance_param=0.25):
super(FocalLoss_2, self).__init__()
self.num_classes = num_classes
self.focusing_param = focusing_param
self.balance_param = balance_param
def focal_loss(self, x, y):
"""
"""
x = x[:, 1:]
sigmoid_p = F.sigmoid(x)
anchors, classes = x.shape
t = torch.FloatTensor(anchors, classes+1)
t.zero_()
t.scatter_(1, y.data.cpu().view(-1, 1), 1)
t = Variable(t[:, 1:]).cuda()
zeros = Variable(torch.zeros(sigmoid_p.size())).cuda()
pos_p_sub = ((t >= sigmoid_p).float() * (t-sigmoid_p)) + ((t < sigmoid_p).float() * zeros)
neg_p_sub = ((t >= zeros).float() * zeros) + ((t <= zeros).float() * sigmoid_p)
per_entry_cross_ent = (-1) * self.balance_param * (pos_p_sub ** self.focusing_param) * torch.log(torch.clamp(sigmoid_p, 1e-8, 1.0)) -(1-self.balance_param) * (neg_p_sub ** self.focusing_param) * torch.log(torch.clamp(1.0-sigmoid_p, 1e-8, 1.0))
return per_entry_cross_ent.mean()
def forward(self, loc_preds, loc_targets, cls_preds, cls_targets):
batch_size, num_boxes = cls_targets.size()
pos = cls_targets > 0
num_pos = pos.data.long().sum()
mask = pos.unsqueeze(2).expand_as(loc_preds)
masked_loc_preds = loc_preds[mask].view(-1,4)
masked_loc_targets = loc_targets[mask].view(-1,4)
loc_loss = F.smooth_l1_loss(masked_loc_preds, masked_loc_targets, size_average=False)
loc_loss = loc_loss/num_pos
pos_neg = cls_targets > -1
mask = pos_neg.unsqueeze(2).expand_as(cls_preds)
masked_cls_preds = cls_preds[mask].view(-1, self.num_classes)
cls_loss = self.focal_loss(masked_cls_preds, cls_targets[pos_neg])
return loc_loss, cls_loss
Question1:
I am still not getting quite write, if I should use 0 as my background class and how normalization is done while focal loss is applied.
pytorch-retinanet/transform.py
Line 122 in b262983
RT
Hi,
The train loss is decreasing with each epoch and validation loss is increasing and becoming stagnant after sometime.
Can u put up some results on VOC datasets. so that we can cross check ?
Thanks
@kuangliu HI
For focal loss. The classification branch uses the sigmoid function.
Why background class is considered in the classification branch?? such as coco, num_classes=80 instead of 81.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.