lightning-ai / torchmetrics Goto Github PK
View Code? Open in Web Editor NEWTorchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.
Home Page: https://lightning.ai/docs/torchmetrics/
License: Apache License 2.0
Torchmetrics - Machine learning metrics for distributed, scalable PyTorch applications.
Home Page: https://lightning.ai/docs/torchmetrics/
License: Apache License 2.0
Add a conda setup for testing against all PyTorch feature releases such as 1.4, 1.5, 1.6, ...
have better validation if some functions are not supported in old PT versions
use CI action with conda setup, probably no need for pull large docker image
take inspiration from past Conda matrix in PL
We would like to have tighter integration of metrics and sweeping. This requires a few features:
higher_is_better
(e.g. are we trying to minimize or maximize the metric in a sweep)An alternative implementation will be for each metric to have is_better(left: TMetricResult, right: TMetricResult)
where TMetricResult is whatever compute
returns.
If we don't have it, people will have to have wrappers around the metrics to support this functionality in sweepers.
Allow MetricCollection
to combine metrics internally to reduce redundant computations.
Many metrics currently shares the same redundant computations underneath. Take Recall
and Precision
for example, they will both calculate tp, fp, tn, fn during their update step and then use them differently during the compute step. We have chosen to do it this way to make the API simple.
However, we could implement that if two metrics that have the same update states are collected using MetricCollection
, that only one metric is updated and the state is just broadcasted to the other metrics.
Keeping track of which metrics can be combined could probably be done with some kind of registry:
@metric_group(Recall, Precision, F1, FBeta)`
@metric_group(MeanSquaredError, PSNR)
...
Metrics should give an option to be compute with per-sample weights, similar to Scikit-Learn (e.g. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)
For a lot of applications, per-sample weighting is important and so metrics package should provide support for this.
Keras metrics provide sample_weight support as well: https://keras.io/api/metrics/classification_metrics/
Add Hinge losses to classification package:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html#sklearn.metrics.hinge_loss
possibly with an added parameter squared
to also calculated squared hinge loss
Add KLDivergence metric. Measures the distance between two probability distributions p(x) and q(x).
Given by (pseudo implementation):
sum(s_p * log(s_p / s_q))
where s_p are samples from the p(x) distribution and s_q are samples from the q(x) distribution.
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
I would like to request the (re-) implementation of the Cohen Kappa score and the new implementation of the Matthews Correlation Coefficient (MCC) in PyTorch Lightning's metrics.
The Cohen Kappa and MCC are often used metrics in classification tasks, especially in a medical setting to determine such things as inter-grader reliability. The Kappa score was originally implemented in PyTorch Lightning 0.9 but has disappeared for some reason. The MCC is often seen as the best metric to use in highly imbalanced datasets. The addition of these two metrics would make it more convenient to use PyTorch Lightning for medical tasks and other tasks that involve ground truth uncertainty and imbalanced data.
Implementation of the Cohen Kappa and MCC as metrics in PyTorch Lightning. Both metrics are already available in sci-kit learn.
Cannot think of any.
None.
Metrics that depend on precision-recall curve are currently implemented in a way that requires storing all of the predictions and labels in memory, making it's use impractical for large datasets or problem with large label spaces. We should support binning-based metrics implementation to solve this. Prototype is here: https://gist.github.com/maximsch2/2b55bab6deba629a5686258cb8152e53
Don't do anything and be restricted in the scalability of metrics.
Other options for scaling is making it easier to keep metrics off-GPU.
A possible question is if we want to have both raw and binned implementations of the metrics.
Keras provides binning-based implementation by default: https://keras.io/api/metrics/classification_metrics/#auc-class
Currently we only test the metrics on single cpu and distributed cpu. While we had no explicit issues that links back to the metrics not beign tested on gpu, we should do it anyway.
I use the commit f06488f to calculate RetrievalMAP
and RetrievalPrecision
in a pytorch-lightning module. The validation_step
and validation_step_end
functions work, but the fast_dev_run=True
gives an error in the compute()
step.
However, when I run self.log("val_MAP", self.metric.compute())
instead of self.log("val_MAP", self.metric)
in the validation_step_end
I do not get errors. But computing the whole metric is becoming very slow if it is done every validation_step.
Steps to reproduce the behavior:
I run the following code with the mentioned commit.
from typing import Optional
import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
import pytorch_lightning as pl
from torchmetrics import (
RetrievalMAP,
RetrievalPrecision,
MeanAbsoluteError,
)
class MNISTDataModule(pl.LightningDataModule):
def __init__(self, batch_size=32):
super().__init__()
self.batch_size = batch_size
def prepare_data(self):
MNIST(os.getcwd(), train=True, download=True)
MNIST(os.getcwd(), train=False, download=True)
def setup(self, stage: Optional[str] = None):
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)
if stage == "fit":
mnist_train = MNIST(os.getcwd(), train=True, transform=transform)
self.mnist_train, self.mnist_val = random_split(
mnist_train, [55000, 5000]
)
if stage == "test":
self.mnist_test = MNIST(
os.getcwd(), train=False, transform=transform
)
def train_dataloader(self):
mnist_train = DataLoader(self.mnist_train, batch_size=self.batch_size)
return mnist_train
def val_dataloader(self):
mnist_val = DataLoader(self.mnist_val, batch_size=self.batch_size)
return mnist_val
def test_dataloader(self):
mnist_test = DataLoader(self.mnist_test, batch_size=self.batch_size)
return mnist_test
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 1), nn.Softplus()
)
self.decoder = nn.Sequential(
nn.Linear(1, 64), nn.ReLU(), nn.Linear(64, 28 * 28)
)
# self.metric = RetrievalMAP()
self.metric = RetrievalPrecision()
# self.metric = MeanAbsoluteError()
def forward(self, x):
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log("train_loss", loss)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
preds = self.encoder(x).squeeze()
indexes = torch.randint(100, size=preds.size())
targets = torch.randint(2, size=preds.size()).to(bool)
return {"indexes": indexes, "preds": preds, "targets": targets}
def validation_step_end(self, outputs):
self.metric(outputs["indexes"], outputs["preds"], outputs["targets"])
# self.metric(outputs["preds"], outputs["preds"] ** 2)
self.log("val_MAP", self.metric)
# self.log("val_MAP", self.metric.compute())
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
if __name__ == "__main__":
datamodule = MNISTDataModule()
module = LitAutoEncoder()
trainer = pl.Trainer(gpus=1, fast_dev_run=True)
trainer.fit(module, datamodule=datamodule)
trainer.test(module, datamodule=datamodule)
Traceback (most recent call last):
File "reproduce_retrieval_error.py", line 107, in <module>
trainer.fit(module, datamodule=datamodule)
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
self.train_loop.run_training_epoch()
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 577, in run_training_epoch
self.trainer.run_evaluation(on_epoch=True)
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 754, in run_evaluation
eval_loop_results = self.evaluation_loop.log_epoch_metrics_on_evaluation_end()
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 200, in log_epoch_metrics_on_evaluation_end
eval_loop_results = self.trainer.logger_connector.get_evaluate_epoch_results()
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 286, in get_evaluate_epoch_results
metrics_to_log = self.cached_results.get_epoch_log_metrics()
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 405, in get_epoch_log_metrics
return self.run_epoch_by_func_name("get_epoch_log_metrics")
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 398, in run_epoch_by_func_name
results = [func() for func in results]
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 398, in <listcomp>
results = [func() for func in results]
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 128, in get_epoch_log_metrics
return self.get_epoch_from_func_name("get_epoch_log_metrics")
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 121, in get_epoch_from_func_name
self.run_epoch_func(results, opt_metrics, func_name, *args, **kwargs)
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 110, in run_epoch_func
metrics_to_log = func(*args, add_dataloader_idx=self.has_several_dataloaders, **kwargs)
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 327, in get_epoch_log_metrics
result[dl_key] = self[k].compute().detach()
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/torchmetrics/metric.py", line 228, in wrapped_func
self._computed = compute(*args, **kwargs)
File "/home/netter/.cache/pypoetry/virtualenvs/stereographic-link-prediction-ra14Y8Aq-py3.8/lib/python3.8/site-packages/torchmetrics/retrieval/retrieval_metric.py", line 110, in compute
idx = torch.cat(self.idx, dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat. This usually means that this function requires a non-empty list of Tensors. Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
CPU: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:5925 [kernel]
CUDA: registered at /pytorch/build/aten/src/ATen/RegisterCUDA.cpp:7100 [kernel]
QuantizedCPU: registered at /pytorch/build/aten/src/ATen/RegisterQuantizedCPU.cpp:641 [kernel]
BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
AutogradOther: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCPU: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCUDA: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradXLA: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradNestedTensor: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse1: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse2: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse3: registered at /pytorch/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
Tracer: registered at /pytorch/torch/csrc/autograd/generated/TraceType_2.cpp:10525 [kernel]
Autocast: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:254 [kernel]
Batched: registered at /pytorch/aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
This error should not show up. I expect the metric to be computed correctly. When I use the MeanAbsoluteError
as metric the code works. Therefore, there must be a bug in the compute
step of the Retrieval Metrics in combination with pytorch-lightnings API as a call to compute()
within the validation_step_end
does not create errors.
conda
, pip
, source): pip / poetryAdd Frรฉchet inception distance (FID) metric.
Standard measure for image quality of generative models.
Originally proposed here:
https://arxiv.org/abs/1706.08500
Other links:
https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance
https://machinelearningmastery.com/how-to-implement-the-frechet-inception-distance-fid-from-scratch/
In addition to Precision and Recall it would be nice to have a Specificity metric.
For the implementation I think it would be enough to make a copy of Recall (class und function) and adapt numerator and denominator in _precision_compute.
For binary classification Specificity is the same as Recall with 0 as true label.
For multiclass classification this is not as easy as this though.
Similary to issue #100 it would be nice to be able to make roc
work with multi-label inputs.
auc
and hence _auroc_compute
do work with multi-label inputs and return a AUROC value for each label/class by iterating over range(num_classes)
when passing average=None
.
_roc_compute
and hence roc
differentiate only between binary and multi-class (by checking if num_classes == 1
)
I would expect _roc_update
to similarly return a mode using _input_format_classification(preds, target)
and return a list of [fpr, tpr, threshold] of length=num_classes.
The easiest would be the format [[fpr, tpr, thres]]*5
Publish package also to Conda distribution
Allow user to install from any source
you can check the documentation at https://conda-forge.org/docs/maintainer/adding_pkgs.html. Itโs actually very easy. In short you must submit a PR to https://github.com/conda-forge/staged-recipes. Once the CI is green you can ping conda-forge folks and they will review it. Once done the feedstock will be created and your package built and uploaded to conda forge.
Add Cosine similarity metric
https://en.wikipedia.org/wiki/Cosine_similarity
Measures the similarity between two feature vectors by calculating the angle between them.
All build-in metrics follow a very fixed structure:
new_metric.py
file and place that in the functional folder
_new_metric_update
, _new_metric_compute
and new_metric
functionnew_metric.py
file in the appropriate class based folder
Metric
classMetricTester
class object for testingThis should be clear from the already implemented metrics, but could be made very clear in contribution guidelines
I implemented my own Metric class that returns from the compute
data class with some aggregated metrics -- precision, recall, and f1-score. But when I try to call metric inside *_step
I got the error from PyTorch internals.
The error happened in this line. If I call validation metric (initialized with compute_on_step=False
) during validation_step
I got:
TypeError: 'NoneType' object is not subscriptable
In the case of training metric during training_step
:
TypeError: 'ClassificationMetrics' object is not subscriptable
ClassificationMetrics
is the name of my data class.
I also tried to return float from compute
, but it also causes the same error. I assume that PyTorch expects to receive tensor and therefore trying to get from var
. An obvious solution is to return tensor from compute
, but it doesn't fix calling validation metric that doesn't return anything.
conda
, pip
, source): pipOffer a dedicated sync()
interface on the base Metric class. This would consolidate state across a provided process group using a given dist_sync_fn
and would let us deprecate the dist_sync_on_step
flag on the metric constructor.
The reason we'd like this is to decouple metric computation and global syncing. As a result, we'd be able to inspect both the local metric state separately from the synced state.
Example scenario:
This interface also enables the training framework to offer higher-level APIs that could automatically call sync()
for a particular Metric
at relevant spots in the training loop (e.g. on_step, or on_epoch in Lightning).
cc @maximsch2
We should be able to re-use most of _sync_dist()
already.
Keep as is
MinMaxMetric
is a metric that simply wraps another metric (e.g.val_acc
) and creates a new metric that tracks the min
, max
or both values of val_acc
.max_val_acc
of a complete experiment in TensorBoard (instead of going through the graph manually to find the max value) but I can see other usecases as well.MaxMetric
code implemented hereWe should provide ability to compute bootstrapped confidence intervals for metrics.
Confidence intervals are important and we should make it easy for people to increase rigor of their research and model evaluations.
I'm thinking we can have something like this (very high level):
class Bootstrapper(Metric):
def __init__(self, num_samples, metric):
self.metrics = nn.ModuleList([deepcopy(metric) for _ in range(num_samples)])
def update(self, preds, targets):
for idx in range(self.num_samples):
preds_sampled, targets_sampled = sample_for_bootstrap(preds, targets)
self.metrics[i].update(preds_sampled, targets_sampled)
which will let people to wrap any metric, have a set of copies of the metric internally updated with different samples of the data, giving us then ability to get a distribution of metric values.
We can skip it on the class-based metrics side and assume anyone doing bootstrap will load everything in memory and do bootstrap using functional metrics.
I am using my own version of MetricLists
in my personal workflow for some time now and it has proven to be very helpful in keeping code clean.
A MetricList
wraps multiple metrics together and puts them on proper devices (much like a ModuleList
). What makes it different is that it also allows you to update all of them in one compute()
statement and log all of them using one log()
call.
My dynamic inference model needs its val_acc
tested in 32 different setups. Manually creating all the different Accuracy()
metrics is ridiculous. ModuleList()
helps to create them in batch but I still need to write helper functions to log()
or compute()
all of them separately.
See pitch.
One common pattern I've seen copy-pasted across many different projects is a generic AverageMeter
, which takes the average of things of a quantity.
This isn't strictly a "metric", but I'm wondering whether you'd be open to having an implementation in this metrics repository -- it's quite common and having it in a centralized place could be helpful. If you are open to it, I'd be happy to contribute an implementation.
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self, name, fmt=':f'):
self.name = name
self.fmt = fmt
self.reset()
def reset(self):
self.val = 0.
self.avg = 0.
self.sum = 0.0
self.count = 0
def update(self, val, n=1):
self.val = float(val.item()) if isinstance(
val, (np.ndarray, torch.Tensor)) else val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def __str__(self):
fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
return fmtstr.format(**self.__dict__)
follow the sample code like https://github.com/PyTorchLightning/metrics,
we use metric = torchmetrics.ROC()
model.roc_metric = metric
in test epoch,
metric.update (output, target)
and after test epoch, run compute
metric.compute()
hang training process and result two gpu in 100% usage
btw, use the metric code in the pytorch_lightning have the same issue as the standalone package
PyTorch Version (1.7.0+cu101):
OS (Linux ubuntu 18.04):
How you installed PyTorch (`pip):
torchmetrics Version: 0.2.0
Python version: 3.6
CUDA/cuDNN version: 10.1/7.6.5.32-1+cuda10.1
GPU models and configuration: two 2080ti
Add a property that for which the user can determine if a metric is differentiable or not
@property
def is_differentiable(self):
return True/False
and add appropriate tests. We can take inspiration from what kornia is doing:
https://github.com/kornia/kornia/blob/master/test/color/test_gray.py#L69
Some metrics support differentiability, some does not. Would be great if we were more explicit about it and actually have test for it.
Presently, when using Accuracy
metric on multi-class with scores (N,C
entry in input types), the scores are required to be probabilities in [0, 1].
However, un-thresholded accuracy can be computed without normalized probabilities as inputs, as relative ordering of scores is all that is needed.
Given that some uses of Accuracy
do require normalized probabilities, we could implement this as a flag that would disable the input check.
It is common to work with unnormalized class scores during training, especially during classification tasks, as they are used in the more-stable nn.CrossEntropyLoss
. Rather than having to additionally compute a softmax just for the accuracy metric, it would be reasonable to allow usage of arbitrarily scaled input data.
I specify Accuracy
because it is the use case that I ran into, but it's possible other Metrics have the same property.
Add a flag to Accuracy
(and any other applicable metrics) that disables the input range check for preds
.
The present workaround is to apply a softmax before feeding data to your Accuracy metric.
In addition to Precision and Recall it would be nice to have a Negative predictive metric.
For the implementation I think it would be enough to make a copy of Precision (class und function) and adapt numerator and denominator in _precision_compute.
For binary classification Specificity is the same as Precision with 0 as true label.
For multiclass classification this is not as easy as this though.
Add deviance scores:
Use in regression tasks and for goodness-of-fit testing.
Add a property to Metrics which can be checked to see if it can be logged or not.
Or better?, what the computed shape will be.
So far all Metrics in PL v1.0.x compute a scalar. The recommended way therefore is to call:
metric(predictions, targets)
self.log("some_name", metric)
which has worked up until now.
However, with upcoming metrics like ConfusionMatrix, the computed value returned is not necessary a scalar, which will result in a ValueError when trying to log it.
If you have multiple metrics, code efficient would be to loop over the metrics, e.g.:
for m in self.metrics:
self.log("metric_name", m)
Adding a Metric that does not return a scalar will break this code.
These are some ideas, but probably there is something better.
(1, )
or 1
self.log()
to deal with non-scalar Metrics.This is a solution that would likely work in most cases, except if on step compute is turned off:
metric(predictions, targets)
if val.numel() == 1: # only scalars
self.log("some_name", metric)
Related discussion on the PyTorch Lightning forums:
https://forums.pytorchlightning.ai/t/logging-a-tensor/320
It would be nice if Accuracy
metric accepted ignore_index
as it works for StatsScore
or torch.nn.CrossEntropyLoss
Refactor Metric.forward()
to call update
only once.
The update()
method in Metric
gets computed twice in forward()
in case the compute_on_step
is True.
This means repeated computation, which can slow down execution. For example, I have a custom SmoothL1Metric
and the update
function calculates element-wise L1
distance (see below). The problem arisesd when the tensors on which the metric is computed have many dimensions and the computation itself is slow.
class SmoothL1Metric(Metric):
def __init__(self, mask_dim, dist_sync_on_step: bool = False, compute_on_step: bool = True):
super().__init__(dist_sync_on_step=dist_sync_on_step, compute_on_step=compute_on_step)
self.loss = torch.nn.SmoothL1Loss(reduction="sum")
self.mask_dim = mask_dim
self.add_state("sum", default=torch.tensor(0.0), dist_reduce_fx="sum")
self.add_state("numel", default=torch.tensor(0.0), dist_reduce_fx="sum")
def update(self, input, target, lens):
mask = get_mask(input, lens, self.mask_dim).type(input.dtype)
# this is a heavy computation that should not be executed twice
self.sum += self.loss(input * mask, target * mask)
self.numel += mask.sum()
def compute(self):
return self.sum / self.numel
How about something like:
def forward(self, *args, **kwargs):
if self.compute_on_step:
self._to_sync = self.dist_sync_on_step
# save context before switch
cache = {attr: getattr(self, attr) for attr in self._defaults.keys()}
# call reset, update, compute, on single batch
self.reset()
self.update(*args, **kwargs)
self._forward_cache = self.compute()
# merge new and old context without recomputing update
for attr, val in cache.items():
setattr(self, attr, self._reductions[attr](val, getattr(self, attr)))
else:
with torch.no_grad():
self.update(*args, **kwargs)
self._forward_cache = None
return self._forward_cache
The code probably does not work now, but the idea should be clear. What do you think?
When updating metrics that are composed of other metrics there are two ways of dealing with updating too many times :
I don't think there is a clean way of only updating the necessary metrics in the general case (when you're just updating all the metrics yourself), but I think that when you combine your metrics in a collection, it could be useful to only update the "base" metric, instead of all metrics.
I often want to use a base metric multiple times, and then I have to be careful not to update too many of them. A somewhat convoluted example (because the f1 score is already implemented) :
prec = Precision()
recall = Recall()
f1 = 2 * (prec * recall) / (prec + recall)
prec.update(pred, gt)
recall.update(pred, gt)
f1.update(pred, gt) # Shouldn't do this, because it updates prec and recall twice.
Continuing last example :
collection = MetricCollection([prec, recall, f1])
collection.update(pred, gt)
This should only update prec and recall once.
The alternative is to always define metrics from scratch, but this causes duplication of computation during the update phase.
You still need to enable some external integrations such as:
The main metric for object detection tasks is the Mean Average Precision, implemented in PyTorch, and computed on GPU.
It would be nice to add it to the collection of the metrics.
The example implementation using numpy
:
Implement Panoptic Quality
Allow a user to define a new metric that takes an item out of an other metric.
Basically :
iou = IoU(num_classes=2, reduction="none")
fg_iou = iou[0]
bg_iou = iou[1]
There are multiple metrics (like IoU and confusion matrix) that would benefit from the use of such a feature, and it is close to the mechanism of metric arithmetic.
This would only need to define
class Metric:
...
def __getitem__(self, idx):
return CompositionalMetric(lambda x: x[idx], self, None)
The straightforward alternative is to use CompositionalMetric directly.
Some metrics for unused classes return NaN, I would say it shall be rather 0
Current metrics like Accuracy/Recall would be better to support mask.
For example, when I deal with a Sequence Labeling Task and pad some sequence to max-length, I do not want to calculate metrics at the padding locations.
I guess a simple manipulation would work for accuracy.(here is the original one)
from typing import Any, Optional
import torch
from pytorch_lightning.metrics.functional.classification import (
accuracy,
)
from pytorch_lightning.metrics.metric import TensorMetric
class MaskedAccuracy(TensorMetric):
"""
Computes the accuracy classification score
Example:
>>> pred = torch.tensor([0, 1, 2, 3])
>>> target = torch.tensor([0, 1, 2, 2])
>>> mask = torch.tensor([1, 1, 1, 0])
>>> metric = MaskedAccuracy(num_classes=4)
>>> metric(pred, target, mask)
tensor(1.)
"""
def __init__(
self,
num_classes: Optional[int] = None,
reduction: str = 'elementwise_mean',
reduce_group: Any = None,
reduce_op: Any = None,
):
"""
Args:
num_classes: number of classes
reduction: a method for reducing accuracies over labels (default: takes the mean)
Available reduction methods:
- elementwise_mean: takes the mean
- none: pass array
- sum: add elements
reduce_group: the process group to reduce metric results from DDP
reduce_op: the operation to perform for ddp reduction
"""
super().__init__(name='accuracy',
reduce_group=reduce_group,
reduce_op=reduce_op)
self.num_classes = num_classes
self.reduction = reduction
def forward(self, pred: torch.Tensor, target: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
"""
Actual metric computation
Args:
pred: predicted labels
target: ground truth labels
mask: only caculate metrics where mask==1
Return:
A Tensor with the classification score.
"""
mask_fill = (1-mask).bool()
pred = pred.masked_fill_(mask=mask_fill, value=-1)
target = target.masked_fill_(mask=mask_fill, value=-1)
return accuracy(pred=pred, target=target,
num_classes=self.num_classes, reduction=self.reduction)
It looks like some metrics such as Precision-Recall curve don't work on CPUs when using float16
, perhaps due to a missing feature in pytorch?
https://colab.research.google.com/drive/1xDv043rRi5WBshP4m5aoxTt2ChlfxjIk?usp=sharing
the metrics should work in half precision on CPUs as well.
Add 'micro' to list of allowed averages.
For now, 'average' must be in [None, 'macro', 'weighted].
For multi-label classification it would be nice to allow micro averaging.
This comes down to calculating auroc(preds.flatten(), target.flatten())
.
Similarly to https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html.
One could also consider adding 'samples' to the list of accepted averages.
What do you think?
Requires just a little modification to remove lightning as a dependency (it will still be used for testing)
Implement the average
argument like in Precision and Recall such that accuracy metric can return the metric per class label.
Sometimes it may be beneficially to look at the accuracy per label, especially when working with very unbalanced datasets
I'd like to propose to update class metrics interface of Precision/Recall/Fbeta
to have the average
argument include none
and weighted
as in the corresponding functional metrics interface.
Current interface with average
argument restricts to macro
and micro
, and because of that one could not use class metrics interface to calculate precision/recall/fbeta for an individual class. For example, in binary classification, one is typically interested in getting metrics results for positive class (class 1) and this cannot be done with the current class interface. Therefore one has go back to the functional metric and this could defeat the purpose of having class metrics (to take care of ddp sync).
On the contrary, sklearn defaults to calculate precision/recall/fbeta for the individual class (class 1) while giving one option to calculate micro/macro/weighted average of these scores.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html
Update class metrics interface of Precision/Recall/Fbeta to have the average
argument include none
and weighted
as in the corresponding functional metrics interface.
One can always fall back to the functional metric but I assume this is not what we would like.
Really like the new class interface to work with DDP and appreciate all your work!
MetricCollection
documentation mentions using self.log_dict(self.train_metrics, on_step=True, on_epoch=False, prefix='train')
. The prefix
parameter doesn't seem to be present in log_dict
function header.
prefix
is most likely usable in this context, so this feature should be implemented. If not - the documentation should be fixed.
Let's have a formal system of task types. Things like BinaryClassificationTask
, MultiClassClassificationTask
, MultilabelClassificationTask
, etc.
Add a type hierarchy of possible task types. Each task is defined by type signature of the (predictions, labels) tuple and semantics inside it (e.g. multticlass and multilabel have same shape, but different semantics).
Then, each metric takes a task_type and can assume that predictions/labels conform to it. If we want to add checking at the run_time, each type can provide a class method (e.g. BinaryClassificationTask.validate_input
) that can be enabled for checking on opt-in basis.
Implement ROUGE
I am trying to analyze a model that has multi-label predictions. When creating a confusion matrix with the functional confusion_matrix
method, I get a much different result than expected. I may be misunderstanding how this is supposed to work so any help would be appreciated!
Steps to reproduce the behavior:
torch.sigmoid
applied to the output (N,C)
and have a matching shape truth data.confusion_matrix
method on the data>>> from torchmetrics.functional import confusion_matrix
>>> import torch
>>> x = torch.tensor([[.4,.5,.6,.7],[.3,.4,.7,.1]])
>>> y = torch.tensor([[0,0,0,1],[0,1,0,0]], dtype=torch.int32)
>>> cm = confusion_matrix(x, y, num_classes=4, normalize='none')
tensor([[3., 3., 0., 0.],
[1., 1., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
I would expect the confusion matrix to count the classes that were predicted for each true class. I may be wrong
tensor([[0, 0, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 0],
[0, 1, 1, 1]])
conda
, pip
, source): condaThanks for the great project and help!!
Support other types for classification metrics. I.e non-softmaxed network outputs.
For outputs of categorical classification, it does not matter if the output is softmaxed or not. The argmax of these tensors is the same. Can we support those by simply taking an argmax even if the values are out of the 0-1 range?
cc @SkafteNicki Whether we want to support this.
The return of f1 and precision is wrong.
from pytorch_lightning.metrics.functional import *
y_pred = torch.Tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
y_true = torch.Tensor([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
tp, fp, tn, fn, _ = stat_scores(y_pred, y_true, 1) #tp, fp, tn, fn = [8, 8, 0, 0], if 0 is positive.
p = precision(y_pred, y_true, 2) # it return 0.5; tp/(tp+fp) = 0.5; if take 1 as postive, precision should be 0.
r = recall(y_pred, y_true, 2) # it return 0.5; but tp/(tp+fn) should be 1.
f1_score = f1(y_pred, y_true, 2) # returns 0; which is not right too.
As mentioned above, If we take 0
as positive class, then tp, fp, tn, fn = [8, 8, 0, 0]
, and precision
will be 0.5
, recall
should be 1
. But the precision()
method get a 0.5
output.
The value could be consistent. And a give parameter that could make any class as positive (like sklearn) would be easier to usr.
*python = 3.8.5
*pytorch-lightning=1.1.6
*pytorch=1.7
Improve test utilities to accept metrics with a variable number of arguments (at the moment only 2 args are allowed).
At the moment the test utilities accept only 2 arguments in input: preds
and target
. Some metrics, like RetrievalMAP
and RetrievalMRR
require a different number of arguments.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.