Comments (9)
Hi, sorry for the delay.
Is this metric similar to reciprocal rank except you average over the top k positions instead of considering only the first retrieved relevant?
Could you please provide an example?
from ranx.
See below an example from pyxclib:
def psprecision(X, true_labels, inv_psp, k=5, sorted=False, use_cython=False):
"""
Compute propensity scored precision@k for 1-k
Arguments:
----------
X: csr_matrix, np.ndarray or dict
* csr_matrix: csr_matrix with nnz at relevant places
* np.ndarray (float): scores for each label
User must ensure shape is fine
* np.ndarray (int): top indices (in sorted order)
User must ensure shape is fine
* {'indices': np.ndarray, 'scores': np.ndarray}
true_labels: csr_matrix or np.ndarray
ground truth in sparse or dense format
inv_psp: np.ndarray
propensity scores for each label
k: int, optional (default=5)
compute propensity scored precision till k
sorted: boolean, optional, default=False
whether X is already sorted (will skip sorting)
* used when X is of type dict or np.ndarray (of indices)
* shape is not checked is X are np.ndarray
* must be set to true when X are np.ndarray (of indices)
use_cython: boolean, optional, default=False
whether to use cython version to find top-k element
* defaults to numba version
* may be useful when numba version fails on a system
Returns:
-------
np.ndarray: propensity scored precision values for 1-k
"""
indices, true_labels, ps_indices, inv_psp = _setup_metric(
X, true_labels, inv_psp, k=k, sorted=sorted, use_cython=use_cython)
eval_flags = _eval_flags(indices, true_labels, inv_psp)
ps_eval_flags = _eval_flags(ps_indices, true_labels, inv_psp)
return _precision(eval_flags, k)/_precision(ps_eval_flags, k)
from ranx.
A common characteristic of Extreme Multi-class Text Classification (XMTC) is the long tail distribution of huge label space (classes). Therefore it is recommended that XMTC methods also be evaluated with respect to propensity-scored metrics such as PSP@k
(propensity-scored precision at k) and PSnDCG@k
(propensity-scored nDCG at k) as described in Propensity-scored Performance at the Top.
from ranx.
ranx
is for evaluating ranking tasks, not classification ones. I prefer to keep it this way for the moment. Therefore, I am not going to add the requested metric.Even as a classification task, it is often approached using information retrieval methods. Since there are millions of labels, the common approach is to retrieve a set of labels and rank them. Therefore XMTC is a ranking task. Several papers employ MRR and nDCG as evaluating metrics.
However, the labels follow a long tail distribution, and for this reason, it is important to weigh them concerning their frequencies. It would be great if you rethink and publish also the Propensity-scored ranking metric as demonstrated below.
Hello @AmenRa? What do you think? I think it is not too hard. Basically, we have to pass the doc's propensity (as a simple list of weights) to the desired metric. I could try to integrate if you help me.
from ranx.
I guess for PSP@K, I need to change hits += 1.0 to hits += 1.0*1/pl
, where pl
is the user-provided propensity.
def _hits(qrels, run, k, rel_lvl):
qrels = clean_qrels(qrels, rel_lvl)
if len(qrels) == 0:
return 0.0
k = fix_k(k, run)
max_true_id = np.max(qrels[:, 0])
min_true_id = np.min(qrels[:, 0])
hits = 0.0
for i in range(k):
if run[i, 0] > max_true_id:
continue
if run[i, 0] < min_true_id:
continue
for j in range(qrels.shape[0]):
if run[i, 0] == qrels[j, 0]:
hits += 1.0
break
return hits
Is that right?
from ranx.
I suggest you first provide some simple test cases for the propensity-based metrics.
I have no time to read the related papers or debug code in the wild.
from ranx.
ranx
is for evaluating ranking tasks, not classification ones.
I prefer to keep it this way for the moment.
Therefore, I am not going to add the requested metric.
from ranx.
ranx
is for evaluating ranking tasks, not classification ones. I prefer to keep it this way for the moment. Therefore, I am not going to add the requested metric.
Even as a classification task, it is often approached using information retrieval methods. Since there are millions of labels, the common approach is to retrieve a set of labels and rank them. Therefore XMTC is a ranking task. Several papers employ MRR and nDCG as evaluating metrics.
However, the labels follow a long tail distribution, and for this reason, it is important to weigh them concerning their frequencies. It would be great if you rethink and publish also the Propensity-scored ranking metric as demonstrated below.
from ranx.
I suggest you first provide some simple test cases for the propensity-based metrics. I have no time to read the related papers or debug code in the wild.
Hello @AmenRa, sorry for the late response. I've been working on the requested test cases, which took longer than expected.
Since this request was closed, I've provided the test cases and explanations in the new Feature Request.
from ranx.
Related Issues (20)
- [BUG] dcg and dcg_burges do not work in the compare function HOT 2
- [Feature Request] Use black to indent the code HOT 1
- [BUG] RBP with multiple relevance levels HOT 3
- [Feature Request] Support gzipped files? HOT 3
- [Feature Request] memory issue / make Run more efficient HOT 2
- Incorrect result for f1 score HOT 13
- Zero-scored documents HOT 10
- [BUG] Misleading exception message on dataframe types HOT 2
- [BUG] Issues when storing/loading Qrels from a dataframe and a parquet file. HOT 6
- [Feature Request] Run.from_df and Run.from_parquet does not allow specifying run name HOT 1
- Question on rank aggregation usage HOT 4
- Getting "Segmentation fault (core dumped)" error HOT 2
- [Feature Request] stddev statistic HOT 3
- Couldn't find any documentation about Qrel and run score range HOT 2
- [Feature Request] Propensity-scored Metrics HOT 1
- How do we compare different runs with multiple folds per run? HOT 1
- [Question] About the correction among multiple hypotheses HOT 1
- [Question] How to compute precision for a retriever operating at passage-level HOT 1
- JIT compilation on serverless (i.e. Modal Labs) HOT 1
- I'm getting NaN for the BPref measurement HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ranx.