median-research-group / libmtl Goto Github PK

A PyTorch Library for Multi-Task Learning

License: MIT License

Python 100.00%

multitask-learning multi-task-learning pytorch python multi-domain-learning deep-learning mtl mmoe ple multiobjective-optimization

libmtl's Issues

Support for missing labels

Thank you for this excellent library. I'm wondering if you would consider supporting single-input problems where there are incomplete labels for each task.

A simple example:

Input: Image from the camera
Output:
- Task 1: Identify street signs (binary)
- Task 2: Identity pedestrian (binary)
- Task 3: Identify crossing animals (binary)
- ...
- Task 30: Identify street light (binary)
  The problem is the label is not all training examples have labels for all tasks. For instance

label for img1: [0, 1, 1, 0, 1, ..., 0, 0, 1]
label for img2: [1, ?, ?, 1, 1, ..., 1, ?, 0]
label for img3: [1, 0, ?, 1, 0, ..., 0, 1, ?]

(? is a missing label - the label doesn't specify whether or not img2 has a pedestrian or not.)

One could consider this as a multi-input problem, just duplicate the dataset and exclude missing labels for each task -> create 30 datasets. However, this is wasteful to train separate 30 task-specific forward networks without parameter sharing. Even just propagating forward 30 times for a single image is an inefficient usage of computational resources.

Please let me know if I have the wrong assumptions about how LibMTL works, or if my problem could easily be solved by current api. Thank you!

gpu_id is setted as 1, but it still runs on gpu 0. Why is the gpu_id setting invalid?

Saving and loading models

Hello LibMTL team,
I went through your code, and I am not sure which part could be extended to support loading and saving the models to disk.
The model inside the Trainer is not picklable, and I was wondering if you could hint me to the components that would need to be saved and loaded for instance to resume training or postpone testing,
Best,
Salah

trainer can't work

when I try to run this trainer , .next() can't run

-> "AttributeError: 'dict_keyiterator' object has no attribute 'next'"

I use:
python=3.7
torch=11.3

This method has been deleted in python 3?
Why not use next(iter)?

or what should I do to fix it?

怎么样求得网络公共部分

我的网络里用到了fpn，有的head用了fpn的第一层输出，其它head用到了其它层的输出。
我要怎么确定fpn的网络公共部分

uw initialization

Hi, I found that the value -0.5 was used when initializing the parameter in line 19 of uw.py.
My doubt is why this value is not 0, since the variable loss_scale is equivalent to log \sigma in the original paper.

self.loss_scale = nn.Parameter(torch.tensor([-0.5]*self.task_num, device=self.device))

我想问问分支和主线的区别，另外我想问问是否计划实现ubernet

感觉还挺自豪的，谢谢你的工作

Issue with val Epoch and `self.base_result`

Hi,
Thanks for open sourcing this wonderful repo. I have a question regarding the following line

LibMTL/LibMTL/_record.py

Line 96 in d05c80c

self.best_result['epoch'] = epoch

, Shouldn't base result be updated here i.e. self.base_result = new_result else the improvement will always be compared over base_result, and we would not get the best checkpoint. I may be missing something, but the results on my experiments doesn't look correct to me.
thanks

office-31 demo doesn't work

I ran the demo of office-31and found some errors in this example:
The version of pip is not the same version published on github
The operating environment is as follows：

python = 3.8
pytroch = 1.12

In train.py, .next() can't run

After I modified this syntax, the new problem appeared again
data = data.to(self.device, non_blocking=True)

I want to know if this is a version reason or something else

Performance issue for NYUv2

Thank you for your excellent contribution!

I have a question about the performance of MTAN with PCGrad.

In my experiment, the result of NYUv2 is very different from the PCGrad official performance (although I not used UW in this experiment)

(sorry for this image to very small)

Also, when I experimented to the resnet50-HPS without weighting method, the result is beyond the MTAN official performance.

I tried your official training command line in all cases and used the dataset in your Dropbox.

Is that result the right phenomenon?

Thank you in advance!

Multi-input MTL

Most of the research papers for the implemented architectures and weighting methods seem to deal with single-input-multi-output tasks such as simultaneous object detection and segmentation from a single image. However, LibMTL also supports multi-input tasks, where each task has its own data (e.g. MNIST where each digit has its own task-specific layers). Is there a set of research papers that discuss this approach? I'm curious to learn how LibMTL evaluates the loss in these situations. I can dig into the code to get a deeper understanding, but in the meantime, if there are papers that discuss this, it would really help to have some links (and maybe update the docs).

Thanks!

不同weighting方法的rep_grad使用

您好，请问为什么有些方法会限制 rep_grad 为 True 或 False。
例如：GradDrop 只能为 True。而像PCGrad、GradVac、CAGrad等都只能为False。

Question about Cross stitch implementation

Hi, I would like to ask you question about the Cross Stitch implementation.
In your code you create cross_unit as torch.ones(4, self.task_num). So for each of the 4 resnet layers you have tensor of task_num values. For example with 2 tasks it would be tensor(4, 2) and each layer has a 1x2 cross unit. But in the paper in equation (1) for two tasks they have a matrix of 2x2. Is there a reason that you have a 1x2 cross unit, did I misunderstand something in the paper ?

LibMTL/LibMTL/architecture/Cross_stitch.py

Lines 23 to 35 in 3928656

 self.cross_unit = nn.Parameter(torch.ones(4, self.task_num)) 

 def forward(self, inputs): 

 s_rep = {task: self.resnet_conv[task](inputs) for task in self.task_name} 

 ss_rep = {i: [0]*self.task_num for i in range(4)} 

 for i in range(4): 

 for tn, task in enumerate(self.task_name): 

 if i == 0: 

 ss_rep[i][tn] = self.resnet_layer[str(i)][tn](s_rep[task]) 

 else: 

 cross_rep = sum([self.cross_unit[i-1][j]*ss_rep[i-1][j] for j in range(self.task_num)]) 

 ss_rep[i][tn] = self.resnet_layer[str(i)][tn](cross_rep) 

 return ss_rep[3]

Because I was expecting to see something like this in the implementation:

self.cross_unit = nn.Parameter(torch.ones(4, self.task_num, self.task_num)) # matrix for each layer

def forward(self, inputs):
    s_rep = {task: self.resnet_conv[task](inputs) for task in self.task_name}
    ss_rep = {i: [0]*self.task_num for i in range(4)}
    for i in range(4):
        for tn, task in enumerate(self.task_name):
            if i == 0:
                ss_rep[i][tn] = self.resnet_layer[str(i)][tn](s_rep[task])
            else:
                cross_rep = sum([self.cross_unit[i-1][tn][j]*ss_rep[i-1][j] for j in range(self.task_num)]) # access matrix row of each task
                ss_rep[i][tn] = self.resnet_layer[str(i)][tn](cross_rep)
    return ss_rep[3]

About MGDA implementation, some details I want to confirm.

Hi, thanks for your wonderful project. There are some questions I want to confirm when I apply the MGDA weighting method, could you please give me an answer, thanks!

what's the self.rep_tasks?
what's the rep_grad?

For the above two problems, I try to give my answer, the first is the representation generated by the representation layer (sharing parameters), and the second is whether using the gradients of representations.

In this case, my third problem is what is the purpose of the variable rep_grad when it is in MGDA?

It's used to implement the MGDA-UB? I realize that it will save the gradients of self.rep_tasks in the function of _compute_grad() in abstract_weighting.py, so I made such an assumption.

I'm a little bit confused about these technique detail, hope you can help me, thanks again!

question about uw

In the uw original paper, the objective function is:

and according to the paper,

the second item is diffirent with first item in denominators.
But in you code, loss = (losses/(2*self.loss_scale.exp())+self.loss_scale/2).sum(), without distinction between the denominators of these two.
Is that correct?

Support for Visualization

Thanks for your excellent work!

Could LibMTL provide visualization support? We can easily visualize the training progress of the model by using visualization tools, like tensorboardX, or Visdom.

Thanks

Time-series MTL

Hi,

First of all, this is a fantastic library! Amazing work.

My question: can LibMTL can be used in time-series applications? I.e. do we only need to provide encoder/decoder architectures such as LSTMs? Would the loss-weighting methods need to be extended in any way?
I'm fairly new to MTL, so pardon the naive question.

Thanks,
Madhu

为啥没有使用GradNorm 等loss的例子

你好我看代码里面有这些weighting 但是例子中没有使用这块要使用的化需要怎么设置呢

代码里面没看到这些weight的相关代码

Gradvac的梯度更新

您好！Gradvac原文中提到了网络不同层间的梯度相似度最后收敛到不同的值，所以对不同任务以及不同层设置了不同的目标值。
原文描述如下：
To incorporate these three factors, we exploit an exponential moving average (EMA) variable for tasks i, j and parameter
group k (e.g. the k-th layer) as:

但你们实现的Gradvac仍然只是对不同任务间设置了不同目标值。这是否合理？

for GNN

Hi, I know it's kinda OOT but I am curious whether I can apply multi-task learning to graph neural network. What I learn from HPS, we shall share the encoder/decoder across the layer. I am curious should I create an encoder on top of the graph layer? Kinda stuck in this experiment, any suggestion would be helpful. Thanks

Colab tutorial

The library is really interesting.

It would be very useful a Colaboratory notebook to test it online.

Trainer.train()中的变量train_loss_buffer是怎么来的

LibMTL/LibMTL/trainer.py

Line 196 in 3928656

self.model.train_loss_buffer = np.zeros([self.task_num, epochs])

self.model中并不总是有train_loss_buffer这个变量，只有GradNorm和DWA里面有

Questions about AlignMTL

Hi there, I had a hard time reproducing the AlignMTL results. I am wondering if you have encountered the same issue? Have you guys evaluated AlignMTL under HPS?

Thanks,
Max

可视化工具

hi，你有什么对梯度更新的可视化办法吗，或者之后在libmtl里有计划实现

DataLoader errors when I set num_workers>1

I have found someone says set num_workers=0 will work,but it's too slow...My system is Ubuntu

Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x7f7c408fce60>
Traceback (most recent call last):
File "/home/user/miniconda3/envs/IB/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1510, in del
self._shutdown_workers()
File "/home/user/miniconda3/envs/IB/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1493, in _shutdown_workers
if w.is_alive():
File "/home/user/miniconda3/envs/IB/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

Unable to save the trained model

When I try to save the trained model i.e. full model using the following command -

torch.save(model, "<path>")

It throws this error

AttributeError: Can't pickle local object 'Trainer._prepare_model.<locals>.MTLmodel'

Passing arguments to encoder/decoder constructors

Is it possible to currently pass in arguments to initialize the encoders and decoders?
For example, given this linear encoder class:

class SimpleLinearEncoder(nn.Module):
    def __init__(self, n_features, n_hidden1, n_output):
        super(SimpleLinearEncoder, self).__init__()
        self.encoder = torch.nn.Sequential(
            torch.nn.Linear(n_features, n_hidden1),
            torch.nn.ReLU(),
            torch.nn.Linear(n_hidden1, n_output)
        )

    def forward(self, x):
        return self.encoder(x)

In order to create an encoder with 20 inputs, 10 neurons in the hidden layer and 4 outputs without setting them as default argument values, can we pass in these parameters to kwargs? E.g:

kwargs = {"weight_args": {"alpha": 1.5}, "arch_args": {}, "n_features": 20, "n_hidden1": 10, "n_output": 4 }

 model = Trainer(task_dict=task_dict,
                    weighting=weighting_method.__dict__[mtl_weighting_method],
                    architecture=architecture_method.__dict__[mtl_architecture],
                    encoder_class=SimpleLinearEncoder,
                    decoders=decoders,
                    rep_grad=False,
                    multi_input=True,
                    optim_param={'optim': 'sgd', 'lr': 0.005, 'weight_decay': 0.00005, 'momentum': 0.9},
                    scheduler_param={'scheduler': 'step', 'step_size': 100, 'gamma': 0.5},
                    **kwargs)

I wasn't able to get the above to work, and my naive guess is that it's because of these lines in the LibMTL.architecture classes, e.g. HPS:

self.encoder = self.encoder_class()

I'm new to PyTorch so perhaps I'm missing an easy solution.

Thanks.

potential bug about "same_rep"

LibMTL/LibMTL/architecture/abstract_arch.py

Line 48 in 5566bf1

same_rep = True if isinstance(s_rep, list) and not self.multi_input else False

It seems to me that the variable "same_rep" should be "False" if the variable "s_rep" is a list.

DDP mode and amp

�你好，非常感谢您的工作，但在多卡使用时出现了如下问题想请教：
https://github.com/median-research-group/LibMTL/blob/main/LibMTL/weighting/abstract_weighting.py#L43

配置：�MGDA+hardParam

另外还有个问题想请教一下：

项目中的weighter�类型有哪些和半精度amp训练兼容？兼容的话应该在scale之前进行weighting还是scale之后weighting？

v1.1.6文档

作者你好，不知道是否有v1.1.6的最新文档呢？我看新加的一些参数例如cfg等等都没有完整的例子与介绍，文档还是原来的版本

Trainer class use fails with the error "No module named 'torchvision.models.utils'"

Full stack here -
Traceback (most recent call last):
File "src/main/pipelines/train_nsfw_mtl.py", line 11, in
from LibMTL import Trainer
File "/azureml-envs/azureml_8a26314e09753d45d0790003a01faf79/lib/python3.8/site-packages/LibMTL/init.py", line 2, in
from . import model
File "/azureml-envs/azureml_8a26314e09753d45d0790003a01faf79/lib/python3.8/site-packages/LibMTL/model/init.py", line 1, in
from LibMTL.model.resnet import resnet18
File "/azureml-envs/azureml_8a26314e09753d45d0790003a01faf79/lib/python3.8/site-packages/LibMTL/model/resnet.py", line 3, in
from torchvision.models.utils import load_state_dict_from_url
ModuleNotFoundError: No module named 'torchvision.models.utils'

The issue is fixed by using 'torch.hub' instead.

saving model

After raining the training script, where will the model be saved?

PLE模型中如何使用MGDA-UB的weighting

MGDA-UB原文中求representation的梯度，对于PLE和MMOE这类模型来说，representation是经过gate加权求和后的representation？

Decoders weights not updating?

Hi,
Thank you for putting up a such a fantastic MTL experimentation library. I used it for my own datasets and all looked good except when i observed weights of encoder/decoders during training rounds and it seems that only encoder weights get updated after any iteration or epochs but decoders weights remain same. When i just freeze the encoder layers (self.model.encoder.requires_grad_(False))and not the decoders layers than the training loss remain same in all iterations/epochs which means that decoders weights are not updating during training rounds. I tried with HPS architecture and EW weighting. Kindly can you help to debug what can contribute to this issue?

How to transfer to other projects.for example,yolov5

hello,I need these structures or weighting strategies for yolov5, what should i do?

How to export saved models to other formats, such as onnx, mnn, etc

Hello, first of all, thank you for your great work.
I saved the best model while running the Office example, but I want to export this model as an ONNX model and use tools like Netron to understand the model structure. How can I add code to the framework? Can you add an exported component for exporting various models, such as onnx, mnn, tflite

How to apply libMTL to time series prediction（regression）？

libMTL is a nice library. But how to use libMTL for multitask regression? Could you provide an example?

大佬，您好!请问，非同源数据，不同任务的训练样本量相差很大，怎么应用优化器的梯度策略？

问题1：
比如，有task_a和task_b，他们的训练样本是非同源的，task_a的训练样本量是100000，task_b的训练样本量是1000。
如果我用一个dataloader去取数据，同一个batch里可能都是task_a的训练样本，一个叠代只训练了task_a这个任务，无法做多任务的梯度策略。
————————————————————————————————————————————————————
问题2：
同样是多任务非同源训练数据不均衡，如果是task_a、task_b、task_c三任务的话，还可能一个batch里只有task_a和task_c的样本，没有task_b的样本，如果是这样的话，又要怎么去应用梯度策略？

Does your project support AMP?

您好，请问您的项目支持混合精度训练吗（梯度回传）？

nashmtl修改更新次数后报错

我尝试了nashmtl,以下是我的vscode配置列表:

我发现当我修改权重更新频率后,会报错如下:

Wouldn't torch.randn in equal weighting give negative weights to half of the tasks?

Pytorch Distributed DataParallel support

Hi, great work.

Did you guys try to run gradient surgery methods with distributed dataparallel in pytorch. I couldn't run it due to calling backwards multiple times creating gradient syncing issues in distributed learning

When running the example code for QM9, the program seems to enter an infinite loop. QM9案例训练代码无响应

When running train_qm9.py file, the program prints the following information and then becomes unresponsive, as if it has entered an infinite loop. What could be the possible reasons for this issue?

在运行train_qm9.py的时候，程序打印以下信息后就再无动静，似乎进入了死循环？请问可能是什么原因造成的呢？

General Configuration:
Wighting: EW
Architecture: HPS
Rep_Grad: False
Multi_Input: False
Seed: 0
Save Path: None
Load Path: None
Device: cuda:0
Optimizer Configuration:
optim: adam
lr: 0.0001
weight_decay: 1e-05

Total Params: 617675
Trainable Params: 617675
Non-trainable Params: 0

MSELoss missing

some problem about metrics.py

I introduced L1Metric class ,I get a error:

AttributeError: 'L1Metric' object has no attribute 'abs_record'

class L1Metric(AbsMetric):
    r"""Calculate the Mean Absolute Error (MAE).
    """
    def __init__(self):
        super(L1Metric, self).__init__()
        
    def update_fun(self, pred, gt):
        r"""
        """
        abs_err = torch.abs(pred - gt)
        self.record.append(abs_err)
        self.bs.append(pred.size()[0])
        
    def score_fun(self):
        r"""
        """
        records = np.array(self.abs_record)
        batch_size = np.array(self.bs)
        return [(records*batch_size).sum()/(sum(batch_size))]

L1Metric class inherit AbsMetric class,but AbsMetric class has no attribute 'abs_record',So I guess there are maybe some problems,of course,this property may also come from other places

class AbsMetric(object):
    r"""An abstract class for the performance metrics of a task. 

    Attributes:
        record (list): A list of the metric scores in every iteration.
        bs (list): A list of the number of data in every iteration.
    """
    def __init__(self):
        self.record = []
        self.bs = []
    
    @property
    def update_fun(self, pred, gt):
        r"""Calculate the metric scores in every iteration and update :attr:`record`.

        Args:
            pred (torch.Tensor): The prediction tensor.
            gt (torch.Tensor): The ground-truth tensor.
        """
        pass
    
    @property
    def score_fun(self):
        r"""Calculate the final score (when an epoch ends).

        Return:
            list: A list of metric scores.
        """
        pass
    
    def reinit(self):
        r"""Reset :attr:`record` and :attr:`bs` (when an epoch ends).
        """
        self.record = []
        self.bs = []

Wrong condition

LibMTL/LibMTL/_record.py

Line 63 in 70f0f00

 if epoch == 0 and self.base_result is None and (mode=='val' if self.has_val else 'test'): 

if epoch == 0 and self.base_result is None and mode==('val' if self.has_val else 'test'): should be

Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

我实验下来也这样，在我的网络里提升有限。
大佬，你怎么看这篇paper
https://arxiv.org/pdf/2209.11379.pdf

Support for Nash-MTL

Thanks for this great repo, it is very useful!

Could you please add support for the Nash-MTL method described in the paper "Multi-Task Learning as a Bargaining Game"?

Paper: https://arxiv.org/abs/2202.01017
Official code: https://github.com/AvivNavon/nash-mtl

Thanks

大佬考虑过支持mscoco数据集么，可以把检测加进来这种，另外最近新出了一篇论文，mod-squad，感觉想法很好

更换损失函数时的两个问题

您好，我在更换损失函数的时候有两个问题想请教一下，谢谢！

目前使用的版本是1.1.6

把损失函数从CELoss换成KLDivLoss的时候会出现维度对不上问题

 File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/core/trainer.py", line 461, in train
     train_losses[tn] = self._compute_loss(
   File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/core/trainer.py", line 304, in _compute_loss
     train_losses = self.losses[task_name].update_loss(preds[task_name], gts)
   File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/loss/abstract_loss.py", line 59, in update_loss
     loss = self.compute_loss(pred, gt)
   File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/loss/KLDivLoss.py", line 19, in compute_loss
     loss = self.loss_fn(pred, gt)
   File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
     return forward_call(*input, **kwargs)
   File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 465, in forward
     return F.kl_div(input, target, reduction=self.reduction, log_target=self.log_target)
   File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/functional.py", line 2916, in kl_div
     reduced = torch.kl_div(input, target, reduction_enum, log_target=log_target)
RuntimeError: The size of tensor a (64) must match the size of tensor b (31) at non-singleton dimension 1

在原有损失函数换成一个新的损失时会出现，inplace操作，但是在代码中好像没有出现inplace，如果还是计算交叉熵则不会有问题

修改的代码

decoder_soft_loss = nn.KLDivLoss(reduction="batchmean")(
                                 nn.functional.log_softmax(unlearned_decoder / 10.0, dim=1),
                                 nn.functional.softmax(init_decoder / 10.0, dim=1))

其中unlearned_decoder 是模型输出的pred，init_decoder 是初始化模型输出的pred

错误信息

  File "train_office.py", line 12, in <module>
    Officemodel.kd_train()
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/core/trainer.py", line 712, in kd_train
    w = self.model.backward(train_losses, **weighting_arg)
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/LibMTL/weighting/DWA.py", line 40, in backward
    loss.backward()
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 31]], which is output 0 of AsStridedBackward0, is at version 5; expected version 4 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

	self.cross_unit = nn.Parameter(torch.ones(4, self.task_num))

	def forward(self, inputs):
	s_rep = {task: self.resnet_conv[task](inputs) for task in self.task_name}
	ss_rep = {i: [0]*self.task_num for i in range(4)}
	for i in range(4):
	for tn, task in enumerate(self.task_name):
	if i == 0:
	ss_rep[i][tn] = self.resnet_layer[str(i)][tn](s_rep[task])
	else:
	cross_rep = sum([self.cross_unit[i-1][j]*ss_rep[i-1][j] for j in range(self.task_num)])
	ss_rep[i][tn] = self.resnet_layer[str(i)][tn](cross_rep)
	return ss_rep[3]

median-research-group / libmtl Goto Github PK

libmtl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs