amenra / ranx Goto Github PK

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

Home Page: https://amenra.github.io/ranx

License: MIT License

Python 88.27% Jupyter Notebook 11.24% Makefile 0.50%

ranking-metrics numba python evaluation evaluation-metrics information-retrieval recommender-systems information-retrieval-evaluation information-retrieval-metrics data-fusion metasearch rank-fusion score-fusion comparison

ranx's Issues

Problems with MAP

I understood that, when evaluating MAP@k, relevance judgment scores equal to 0 are ignored.
In my case, I get a bit of a weird behaviour.

I'm working on a balanced dataset with binary relevancy and define qrels by including both 1s and 0s documents.
While ndcg@10 gives me results at about 0.7, MAP@10 is extremely low at about 0.10.

Can this be because, besides the very first documents, the model perform poorly or am I doing something wrong when evaluating?

qrels = Qrels.from_df(
    df=test_loaded_pdf,
    q_id_col="user_id",
    doc_id_col="run_session_id",
    score_col="target_binary",
)

run = Run.from_df(
    df=test_loaded_pdf,
    q_id_col="user_id",
    doc_id_col="run_session_id",
    score_col="predictions",
)

evaluate(qrels, run, ["map@10", "mrr", "ndcg@10"])

predictions in test_loaded_pdf is not a list of binary relevancy but it's a float relevancy score

Spearman, Kendall correlation functions

Hi! This lib is extremely good tool to have in arsenal, but I think it could be nice to have Spearman and Kendall correlation functions included to this package to evaluate ranking. Maybe not the most popular metrics, but sometimes they could come in handy.
Best regards,
Ivan Savchuk

feature request: save Report.comparisons as JSON

Hi,

It’d be nice to be able to save a Report comparisons as a JSON file.
However, since it uses frozenset as keys, it is not JSON serializable.

Maybe you could add a method in https://github.com/AmenRa/ranx/blob/master/ranx/frozenset_dict.py to convert the _map to a JSON serializable dict, i.e. with str keys?
The str keys could be converted from the frozenset like: ', '.join(frozenset({'foo', 'bar'}))

[BUG] RBP with multiple relevance levels

Describe the bug

Given the formula of RBP of

    \operatorname{RBP} = (1 - p) \cdot \sum_{i=1}^{d}{r_i \cdot p^{i - 1}}

where r_i is the reward/utility, RBP should support multiple relevance levels similar to DCG such that if max relevance level is 2 max rbp value should be 2

[Feature Request] stddev statistic

It would be a good to additionally calculate the sample variance within the average of the metrics.

Like ndcg@50: 11, stddev: 2.8

[Feature Request] Add interpolated recall-precision plot function

Is your feature request related to a problem? Please describe.
First of all: This is a really nice library! It helps a lot!
My request is regarding a recall-precision graphic. When I read TREC-related papers, they very often used the interpolated precision-recall-plot to visualize the performance of IR-Systems which are being compared to each other. They also use the graphs to understand which IR-System yields a higher recall value, shows a higher precision value, and generally has a better performance.

Describe the solution you'd like
Since I love using this library, it would be great, if there were a function for generating such a precision-recall-plot or just an example/notebook for generating this plot using the ranx library in the documentation.

Describe alternatives you've considered
I've already considered creating these plots myself with the functions that the library offers. However, it seems quite complex as the interpolation greatly complicates this work for me.

Additional context
Here is an example interpolated recall-precision plot:

More information can be found on:
https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
and
https://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf (p.4)

[Feature Request] infer run/qrels format from file extension in `from_file`

Is your feature request related to a problem? Please describe.
I think it’s quite frustrating to have to specify the format of qrels/run (TREC or JSON). I often get exceptions if I forget to specify the 'trec' format because JSON is default.

Describe the solution you'd like
You could infer the format from the file extension: if .json then JSON else TREC. I can open a PR if you’re interested :)

PSP@k: Propensity-scored precision at k

I want to implement the propensity-scored precision at k (PSP@k) as defined above:

$PSP@k = \frac{1}{k} \sum \frac{y_i}{p_i}$

where $p_i$ is the propensity of $y_i$ and $1 \leq i \leq k$.

Therefore, how could I integrate this metric in ranx?

References:

[1] Zhang, J., Chang, W.C., Yu, H.F. and Dhillon, I., 2021. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. Advances in Neural Information Processing Systems, 34, pp.7267-7280.

question: relevance scores in Qrels

Why is the relevance score mandatory in Qrels?

I don’t see where you use it to compute the metrics https://github.com/AmenRa/ranx/blob/master/ranx/metrics.py

Is there any way to make it optional? If not, would a filler value like 0 for all documents be appropriate?

[BUG] ttest_rel() got an unexpected keyword argument 'alternative' when using compare with stat_test="student"

Describe the bug
Hi, I’m having an error when using compare with stat_test="student" (no problem when using the default "fisher").

TypeError                                 Traceback (most recent call last)
<ipython-input-3-0369b81922de> in <module>
      7     metrics=["map@100", "mrr@100", "ndcg@10"],
      8     stat_test="student",
----> 9     max_p=0.01  # P-value threshold
     10 )

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/meta/compare.py in compare(qrels, runs, metrics, stat_test, n_permutations, max_p, random_seed, threads, rounding_digits, show_percentages)
    100         n_permutations=n_permutations,
    101         max_p=max_p,
--> 102         random_seed=random_seed,
    103     )
    104 

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/__init__.py in compute_statistical_significance(model_names, metric_scores, stat_test, n_permutations, max_p, random_seed)
     81                         n_permutations,
     82                         max_p,
---> 83                         random_seed,
     84                     )
     85 

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/__init__.py in _compute_statistical_significance(control_metric_scores, treatment_metric_scores, stat_test, n_permutations, max_p, random_seed)
     38         elif stat_test == "student":
     39             p_value, significant = paired_student_t_test(
---> 40                 control_metric_scores[m], treatment_metric_scores[m], max_p,
     41             )
     42 

/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/paired_student_t_test.py in paired_student_t_test(control, treatment, max_p)
     11 
     12     """
---> 13     _, p_value = ttest_rel(control, treatment, alternative="two-sided")
     14 
     15     return p_value, p_value <= max_p

TypeError: ttest_rel() got an unexpected keyword argument 'alternative'

To Reproduce

In [1]: from ranx import Qrels, Run
   ...: 
   ...: qrels_dict = { "q_1": { "d_12": 5, "d_25": 3 },                                                                                                                                                   
   ...:                "q_2": { "d_11": 6, "d_22": 1 } }
   ...: 
   ...: run_dict = { "q_1": { "d_12": 0.9, "d_23": 0.8, "d_25": 0.7,
   ...:                       "d_36": 0.6, "d_32": 0.5, "d_35": 0.4  },
   ...:              "q_2": { "d_12": 0.9, "d_11": 0.8, "d_25": 0.7,
   ...:                       "d_36": 0.6, "d_22": 0.5, "d_35": 0.4  } }
   ...:                                                                                                                                                                                                   
   ...: qrels = Qrels(qrels_dict)
   ...: run = Run(run_dict)
In [2]: from ranx import compare                                                                                                                                                                           
   ...:                                                                                                                                                                                                    
   ...: # Compare different runs and perform statistical tests                                                                                                                                             
   ...: report = compare(                                                                                                                                                                                  
   ...:     qrels=qrels,                                                                                                                                                                                   
   ...:     runs=[run, run],                                                                                                                                                                               
   ...:     metrics=["map@100", "mrr@100", "ndcg@10"],                                                                                                                                                     
   ...:     stat_test="student",                                                                                                                                                                           
   ...:     max_p=0.01  # P-value threshold
   ...: )

Versions
ranx==0.2.8

Zero-scored documents

@mponza found a bug when computing recall and promise to document it next week, adding a reminder here for him. Something related to multiple documents having score zero.

Question on rank aggregation usage

Thanks for your amazing work. I am very interested in this framework and try to use it to solve my rank aggregation problems. However, I am a little confused about the usage.

For example, I have scores for several items under different ranking rules, as follows:

item | rank1 | rank2 | rank3
item1 | 0.4 | 0.8 | 0.2
item2 | 0.8 | 0.7 | 0.7
item3 | 0.7 | 0.7 | 1.0
item4 | 0.5 | 0.6 | 0.7

Now I want a comprehensive rank that satisfies all the subrank (e.g., rank1, rank2, rank3) constraints as much as possible.

item | rank
item1 | 0.5
item2 | 0.9
item3 | 0.7
item4 | 0.4

I'm not sure if that can be addressed with ranx. If so, could you show me an example?
Thanks a lot.

feature request: hits (or accuracy?)

Hi,

@osf9018 mentioned it in #2 but I guess it’s better to create a specific issue.

Motivation

It is often difficult to estimate the total number of relevant document for a query.
For example, in Question Answering, if you have a large enough Knowledge Base, you can find the answer to your question in a surprisingly large number of documents that one cannot annotate in advance. Because of this, the relevance of the document is often estimated on-the-go, by checking whether the answer string is in the document retrieved by the system.

Because of this, recall is not an appropriate metric. However, one way to circumvent this is to compute recall "as if" there was only a single relevant document. After averaging over the whole dataset, it corresponds to the proportion of question for which the system retrieved at least one relevant document in top-K. This is what @osf9018 and I call "hits@K" (I can’t remember but I’ve seen it in a paper) and others, such as Karpukhin et al., call "accuracy". Accuracy is a confusing term IMO.

The request

Would you be interested in implementing or integrating this feature in your library?
It might take some renaming but it could be implemented very easily by using the _hits function. It is simply min(1, _hits(qrels, run, k))

[Feature Request] relevance_level parameter

I was wondering if similar to trec_eval that we can specify the relevance_level, with -l parameter, this feature exists in ranx. If not that would be a useful feature for evaluation

Integer relevance score

Hi,

Thank you for this nice library.

Is there a fundamental reason why relevance score need to be integers in Qrels?

Thanks.

Issue with MRR

Attempting to the MRR with your example and am getting a Typing error.

System:

python 3.8.2

from rank_eval import ndcg, mrr
import numpy as np

# Note that y_true does not need to be ordered
# Integers are documents IDs, while floats are the true relevance scores
y_true = np.array([[[12, 0.5], [25, 0.3]], [[11, 0.4], [2, 0.6]]])
y_pred = np.array(
    [
        [[12, 0.9], [234, 0.8], [25, 0.7], [36, 0.6], [32, 0.5], [35, 0.4]],
        [[12, 0.9], [11, 0.8], [25, 0.7], [36, 0.6], [2, 0.5], [35, 0.4]],
    ]
)
k = 5

mrr(y_true, y_pred, k, threads=1)

Error message


---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
<ipython-input-24-7d69c81d4a8f> in <module>
     13 k = 5
     14 
---> 15 mrr(y_true, y_pred, k, threads=1)

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py in mrr(y_true, y_pred, k, return_mean, sort, threads)
    482     """
    483 
--> 484     return _choose_optimal_function(
    485         y_true=y_true,
    486         y_pred=y_pred,

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py in _choose_optimal_function(y_true, y_pred, f_name, f_single, f_parallel, f_additional_args, return_mean, sort, threads)
    234         if sort:
    235             y_pred = _descending_sort_parallel(y_pred)
--> 236         scores = f_parallel(y_true, y_pred, **f_additional_args)
    237         if return_mean:
    238             return np.mean(scores)

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    399                 e.patch_message(msg)
    400 
--> 401             error_rewrite(e, 'typing')
    402         except errors.UnsupportedError as e:
    403             # Something unsupported is present in the user code, add help info

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
    342                 raise e
    343             else:
--> 344                 reraise(type(e), e, None)
    345 
    346         argtypes = []

~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/utils.py in reraise(tp, value, tb)
     77         value = tp()
     78     if value.__traceback__ is not tb:
---> 79         raise value.with_traceback(tb)
     80     raise value
     81 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<built-in function contains>) with argument(s) of type(s): (array(float64, 1d, A), float64)
 * parameterized
In definition 0:
    All templates rejected with literals.
In definition 1:
    All templates rejected without literals.
In definition 2:
    All templates rejected with literals.
In definition 3:
    All templates rejected without literals.
In definition 4:
    All templates rejected with literals.
In definition 5:
    All templates rejected without literals.
In definition 6:
    All templates rejected with literals.
In definition 7:
    All templates rejected without literals.
In definition 8:
    All templates rejected with literals.
In definition 9:
    All templates rejected without literals.
In definition 10:
    All templates rejected with literals.
In definition 11:
    All templates rejected without literals.
In definition 12:
    All templates rejected with literals.
In definition 13:
    All templates rejected without literals.
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (78)

File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 78:
def _reciprocal_rank(y_true, y_pred, k):
    <source elided>
    for i in range(k):
        if y_pred[i, 0] in y_true[:, 0]:
        ^

[1] During: resolving callee type: type(CPUDispatcher(<function _reciprocal_rank at 0x7f6697ed9dc0>))
[2] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (11)

[3] During: resolving callee type: type(CPUDispatcher(<function _reciprocal_rank at 0x7f6697ed9dc0>))
[4] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (11)


File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 11:
def _parallelize(f, y_true, y_pred, k):
    <source elided>
    for i in prange(len(y_true)):
        scores[i] = f(y_true[i], y_pred[i], k)
        ^

[1] During: resolving callee type: type(CPUDispatcher(<function _parallelize at 0x7f6697ed4e50>))
[2] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (85)

[3] During: resolving callee type: type(CPUDispatcher(<function _parallelize at 0x7f6697ed4e50>))
[4] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (85)


File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 85:
def _mrr(y_true, y_pred, k):
    return _parallelize(_reciprocal_rank, y_true, y_pred, k)
    ^

Qrels and Run query ids do not match

Describe the bug
I am evaluating my test set against my algorithm recommendations, but it gives the error
Qrels and Run query ids do not match

To Reproduce
Steps to reproduce the behavior:
metrics = ["recall@10", "mrr@10","ndcg@10"]
person_date_indexs = df_recs_train_top10['person_date_index'].unique()
run = tranform_recs_to_ranx(df_recs_train_top10, person_date_indexs, "person_date_index", "gym_id", "rank")
qrels = tranform_test_to_ranx(df_test_checkins_new_col_renamed, df_recs_train_top10)
evaluate(qrels, run, metrics)

Expected behavior
print the recall, mrr and ndcg

Screenshots

Desktop (please complete the following information):
Ubuntu 22.04

Additional context
It worked fine with another test set. The only difference is that in this test set I removed items that the user already interacted in train.

[BUG] from_ir_datasets seems to be missing from Qrels

Describe the bug
Qrels.from_ir_datasets("msmarco-document/dev") does not seem to work, it seems to have been removed by a commit 10 days ago. (e0ca82c)

[Feature Request] memory issue / make Run more efficient

Hi Elias,

Is your feature request related to a problem? Please describe.
I've noticed that Run (and I guess also Qrels) consume a lot of memory (RAM) compared to standard python dict, e.g. a few GB instead of a few 100s of MB. This gets problematic for somewhat large datasets (e.g. 1M queries)

Describe the solution you'd like
I guess it's related to Numba representation? I've no clue on how to make it more efficient, sorry :)

Reproduce
Just open your system monitor and see how the memory grows.

In [1]: import ranx
# this weighs only a few 100s of MB
In [2]: run_d = {str(i): {str(j): 0.0 for j in range(100)} for i in range(100000)}
# this grows to a few GB
In [3]: run_r = ranx.Run(run_d)

Best,

Paul

bug: using built-in `type` function instead of type argument

It’s all in the title :)

ranx/ranx/run.py

Line 140 in f41e64d

assert type in {

[Feature Request] optimize norm and method in optimize_fusion

Is your feature request related to a problem? Please describe.
Instead of manually trying every possible fusion techniques, it’d be nice to be able to grid-search all of them, as optimize_fusion is already doing for fusion hyperparameters (e.g. weights in wsum).

Describe the solution you'd like
Allow to pass List[str] instead of str for norm and method parameters.
Then do something like:

for norm in norms:
    for method in methods:
        optimize_fusion_current_behavior(*args, norm=norm, method=method, **kwargs)

The trick is to report the results in a readable manner.

Getting "Segmentation fault (core dumped)" error

Hello,
Thank you for your amazing work. I am trying to use supervised fusion methods like this:

best_params = optimize_fusion(
qrels=qrels,
runs=[run_4, run_5],
norm='min-max',  # Default value
method='posfuse',  
metric="mrr@10",  
)

combined_run = fuse(
runs=[run_4, run_5],
norm='min-max',  # Default value
method='posfuse',  # Alias for Weighted Sum
params=best_params,
)

I tried changing the norm and metric to every possible value but I still get "Segmentation fault (core dumped)" error.
I could not find any hints in the documentation about. sorry if I am missing something but can you help me with using these fusion methods?
thanks,

[BUG] Issues when storing/loading Qrels from a dataframe and a parquet file.

Describe the bug
Bug when reconstructing Qrels from a pandas dataframe. This bug affects also when reading a Qrel from a parquet file as the pandas to Qrels is used internally.

Pandas version: 1.5.2
Ranx: last pip version

To Reproduce
Code:

from ranx import Qrels
qrels = Qrels({'1':{'1':1}, '2':{'2': 1}})
df = qrels.to_dataframe()
Qrels.from_df(df)

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...\lib\site-packages\ranx\data_structures\qrels.py", line 300, in from_df
    assert df[score_col].dtype == int, "DataFrame scores column dtype must be `int`"
AssertionError: DataFrame scores column dtype must be `int`

About the dataframe dtypes
It is using int64 instead of int

>>> df.dtypes
q_id      object
doc_id    object
score      int64
dtype: object

>>> import numpy as np
>>> df.dtypes['score'] == np.int64
True

[Feature Request] Reciprocal Rank Fusion

Hi,

Thank you for this nice library.

How would you go about implementing Reciprocal Rank Fusion in ranx?

Thanks,

Maxime.

[Feature Request] Run.from_df and Run.from_parquet does not allow specifying run name

Is your feature request related to a problem? Please describe.
I'm using Pandas dataframes, and comparing different embedding models. I'd like to be able to name my runs so when I compare them, the report shows something other than run_1, run_2.

Describe the solution you'd like
Allow name as a named parameter in Run.from_df

Describe alternatives you've considered
You can just set the name afterward, e.g.

my_run.name = "My Run"

Additional context
I guess this is just for consistency, since Run.from_file allows you to pass a name.

[BUG] Misleading exception message on dataframe types

Describe the bug
I'm using the library for the first time with a Pandas dataframe and ran into an exception that was misleading.

To Reproduce
Steps to reproduce the behavior:

Create a dataframe where the id column is of type int64 e.g. df['id'] = df.index + 1
Create the qrel like this:

qrels = Qrels.from_df(
    df=df,
    q_id_col="id",
    doc_id_col="best_document",
    score_col="score",
)

Observe this error:

[/usr/local/lib/python3.10/dist-packages/ranx/data_structures/qrels.py](https://localhost:8080/#) in from_df(df, q_id_col, doc_id_col, score_col)
    293         """
    294         assert (
--> 295             df[q_id_col].dtype == "O"
    296         ), "DataFrame scores column dtype must be `object` (string)"
    297         assert (

AssertionError: DataFrame scores column dtype must be `object` (string)

Expected behavior
The assertion message should point to the correct column, in this case, it is the ID column that is of the wrong type. From inspecting the code, the assertion message is wrong when the document ID column is of the wrong type as well.

[BUG] `MRR@1` is not equal `Recall@1`

Describe the bug
MRR@1 should be equal to Recall@1. However, these metrics diverge for the case below.

To Reproduce

%%capture
!pip install ranx

from ranx import Qrels, Run, evaluate
import pickle

# download files from https://drive.google.com/drive/folders/1ZLyB6mKKiQsypw36nhdZ4dGqmFw27K-3?usp=sharing
with open("qrels.pkl", "rb") as f:
    qrels = pickle.load(f)
with open("run.pkl", "rb") as f:
    run = pickle.load(f)

evaluate(
    Qrels(qrels),
    Run(run),
    ['mrr@1', 'mrr@5', 'mrr@10', 'recall@1', 'recall@5', 'recall@10'])


# {'mrr@1': 0.8133879123525163,
#  'mrr@5': 0.820242395055783,
#  'mrr@10': 0.8206332007078454,
#  'recall@1': 0.04814167526511499,
#  'recall@5': 0.05089848464127321,
#  'recall@10': 0.05171913427724859}

or use Google Colab.

Expected behavior
mrr@1=recall@1

Am I missing something?

Why ranx is too slow in this simple example?

from ranx import Qrels
from ranx import Run
from ranx import evaluate

qrels_dict = {
    "text_1": {
        "label_1": 1
    },
    
    "text_2":{
        "label_2": 1,
    }
}

qrels = Qrels(qrels_dict, name="testing")



run_dict = {
    "text_1": {
        "label_1": 1,
        "label_2": 0.9,
        "label_3": 0.8,
        "label_4": 0.7,
        "label_5": 0.6,
        "label_6": 0.5,
        "label_7": 0.4,
        "label_8": 0.3,
        "label_9": 0.2,
        "label_10": 0.1,
    },
    
    "text_2": {
        "label_1": 0.9,
        "label_2": 1,
        "label_3": 0.8,
        "label_4": 0.7,
        "label_5": 0.6,
        "label_6": 0.5,
        "label_7": 0.4,
        "label_8": 0.3,
        "label_9": 0.2,
        "label_10": 0.1,
    },
}

run = Run(run_dict, name="bm25")

evaluate(qrels, run, ["mrr@1", "mrr@5", "mrr@10"])

CPU times: user 55.2 s, sys: 467 ms, total: 55.7 s
Wall time: 56.8 s

This behavior can be reproduced in this Google Colab Notebook 1 and also in this Google Colab Notebook 2 (spent time per steps).

Incorrect result for f1 score

Using f1 or _f1_parallel for all qrels and run gives incorrect output. But if I use _f1 on individual query case, it gives correct F1 score.

using below 2 functions return 0 for 4 cases. Ideally it should only be 0 for 1 of all the 18 cases in _qrels & _run passed.

from ranx.metrics.f1 import _f1_parallel, _f1, f1

f1(_qrels, _run, 1, 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

_f1_parallel(_qrels, _run, 1 , 1)
# output 
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])

But if I call _f1 for each item in qrels, I get correct F1 score for all queries.

scores = []

for i in prange(len(_qrels)):
  try:
      scores.append(_f1(_qrels[i], _run[i], 1 , 1))
  except Exception as error:
    # handle the exception
    print(f" {i} An exception occurred:", error)
    scores.append(0)
    continue

# output
# [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0]

Notice, I get only 1 case where the F1 score is 0, which I think is expected for my case.

More context -

I am using f1 metric at k =1 for a dataset where each query has only 1 relevant document. There are 18 unique qrels.
Since each query only has 1 relevant document, the score for mrr@1 = recall@1 = 0.9444

also, precision@1 = 0.944 for my case.

Here's the output of my evaluation -
{'mrr@1': 0.9444444444444444, 'mrr@2': 0.9444444444444444, 'recall@1': 0.9444444444444444, 'recall@2': 0.9444444444444444, 'precision@1': 0.9444444444444444, 'f1@1': 0.7777777777777778}

f1@1 seems very low considering precision & recall @ 1 are equal with high value.

Here's the pretty output of scores for each individual query -
I've manually validated the output for all the metrics and all of them seem correct to me except for f1@1 .
Notice that hits score is 0 only for q_id '6', so f1 score 0 for q_id '6' is expected, but f1 score is also 0 for q_id '7' , '8' and '9' .

{
"mrr": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.3333333333333333,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"mrr@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"recall@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"precision@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"f1@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 0.0,
"8": 0.0,
"9": 0.0
},
"hits@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
}
}

Thanks in advance.
Let me me know if I am missing something.

[BUG] Missing results causes AssertionError

Missing results causes AssertionError: Qrels and Run query ids do not match
You should make check_keys optional in evaluate because sometimes queries do not return any results for lexical-based systems.

ranx/ranx/meta/evaluate.py

Line 128 in da0aa52

check_keys(qrels, run)

[Feature Request] Support gzipped files?

Is your feature request related to a problem? Please describe.
trec files can be several megabytes, for example run_*.trec used for the examples are all more than 20Mb, but once compressed they become less than 10. That would make downloads faster and also loading the file in memory.

Describe the solution you'd like
support *.trec.gz.

Describe alternatives you've considered
It would cool to evaluate also alternative formats to store the trec file, like [parquet] (https://arrow.apache.org/docs/python/parquet.html), this library focus on computing metrics fast, but if you spend ages to load/parse the trec file it is not very useful - parquet is much faster to load in memory and it is supports compression natively.

Additional context
💨

set Report’s rounding_digits in compare

Hi,

compare does not have a rounding_digits argument and thus always uses the default from Report (which is 4). Why is that?

Also, would you like to add an option in Report to print results as percentages rather than ratios ?

Problem with r-precision

Hi,

I tested your code and found that it was easy to use and integrate. Moreover, the results I got are fully coherent with those I previously obtained with a personal implementation of trec_eval and the computation of the measures is fast. This is clearly an interesting software and its presentation to the demo session of ECIR 2022 is a good thing.

I had only a problem with the R-precision measure. The main problem is that if you replace ""ndcg@5" in the 4th cell of the overview.ipynb notebook, you get:

`

TypeError Traceback (most recent call last)
/tmp/ipykernel_28676/2318072837.py in
1 # Compute NDCG@5
----> 2 evaluate(qrels, run, "r-precision")

/vol/data/ferret/tools-distrib/_research_code/rank_eval/rank_eval/meta_functions.py in evaluate(qrels, run, metrics, return_mean, threads, save_results_in_run)
149 for m, scores in metric_scores_dict.items():
150 for i, q_id in enumerate(run.get_query_ids()):
--> 151 run.scores[m][q_id] = scores[i]
152 # Prepare output -----------------------------------------------------------
153 if return_mean:

TypeError: 'numpy.float64' object does not support item assignment
`

I first detected the problem through the integration of your code and obtained the same error. By looking at the file meta_functions.py where the problem arises:

143 if type(run) == Run and save_results_in_run:
144 for m, scores in metric_scores_dict.items():
145 if m not in ["r_precision", "r-precision"]:
146 run.mean_scores[m] = np.mean(scores)
147 else:
148 run.scores[m] = np.mean(scores)
149 for m, scores in metric_scores_dict.items():
150 for i, q_id in enumerate(run.get_query_ids()):
151 run.scores[m][q_id] = scores[i]

I saw your recent last update of this part of the code but there is still a problem since for R-precision, the mean of the scores is stored in run.score and not in run.mean_scores. As a consequence, the use of run.scores for storing the score of each query raises a problem if both return_mean and save_results_in_run flags are set to True. More globally, I am not sure to understand why you differentiate R-precision from the other measures concerning the computation of the mean score.

Thank you by advance for your efforts for fixing the issue.

Olivier

confusing variable

https://github.com/AmenRa/rank_eval/blob/master/rank_eval/meta_functions.py#L220
control_metric_scores could be defined outside the for j loop :)

[BUG] win_tie_loss in Report.to_dict

Describe the bug

When converting a Report to dict, you only keep one metric while iterating over metrics (overwriting the previous metric in each loop)
https://github.com/AmenRa/ranx/blob/master/ranx/report.py#L315

How to fix
replace the line above with:
d[m1]["win_tie_loss"][m2][metric] = self.win_tie_loss[(m1, m2)][metric]
and init d[m1]["win_tie_loss"][m2] = {} at the same place as
https://github.com/AmenRa/ranx/blob/master/ranx/report.py#L309 (just above)

question: Comparing models with multiple runs

First of all, great work on this code. I have been looking for a definitive package to evaluate ranking models and I believe this is that package.

My question is perhaps a bit out of the domain, but it could help others in the future. How would you deal with comparing 2 models where each has multiple runs (e.g., runs with different random initialization and/or batch shuffling, for confidence intervals). I was thinking that perhaps the significance testing could be performed between the mean (across runs) metric_scores vectors.

Thanks in advance,

Tiago

[Feature Request] custom fusion method in optimize_fusion

Is your feature request related to a problem? Please describe.
Hi, you’ve done a great job implementing plenty of different fusion algorithms, but I think it will always be a bottleneck.
What would you think about letting the user define their own training function?

Describe the solution you'd like
For example, in optimize_fusion, allow method to be a callable and in this case, do not call has_hyperparams and optimization_switch.

Describe alternatives you've considered

Open a feature request every time I want to try out something new :)
Fork ranx and implement new fusion methods there

My use case/ Ma et al.
By the way, at the moment, my use case is to use the default-minimum trick of Ma et al.: when combining results from systems A and B, it consists in giving the minimum score of A's results if a given document was only retrieved by system B, and vice-versa.

Maybe this is already possible in ranx via some option/method named differently? Or maybe you’d like to add it in the core ranx fusion algorithms?

question: why student rather than fisher stat test?

Hi,

Just a quick question: I wondered what motivated the choice for changing the default value of the stat test to student instead of fisher 0dc8d9c (I almost published them as is before figuring it out 😅).

I thought that one of your documentation pointed me to this paper of Smucker et al. that suggests using Fisher (and especially not Student), but maybe I don’t recall correctly.

Btw, the docstring still shows 'fisher' as default https://amenra.github.io/ranx/compare/

ValueError: max() arg is an empty sequence

Hello,
I'd like to determine what query is causing the following error and how to get around it:

Traceback (most recent call last):
  File "main.py", line 43, in perform_tasks
    eval(params)
  File "main.py", line 25, in eval
    eval_helper.perform_eval()
  File "/home/celso/projects/XMTC-Baselines/source/helper/EvalHelper.py", line 62, in perform_eval
    qrels = Qrels(filtered_relevance_map)
  File "/home/celso/projects/venvs/XMTC-Baselines/lib/python3.8/site-packages/ranx/data_structures/qrels.py", line 62, in __init__
    max_len = max(len(y) for x in doc_ids for y in x)
ValueError: max() arg is an empty sequence

My evaluation code is shown in the code snippet below.

ranking = self._retrieve(...)
filtered_relevance_map= {key: value for key, value in self.relevance_map.items() if key in ranking.keys()}
qrels = Qrels(filtered_relevance_map)
run = Run(ranking, name=cls)
result = evaluate(qrels, run, self.metrics, threads=12)

[Feature Request] Expose DCG as metric

In industry DCG (in both formulations) is a standard and widely used metric. I see it is already implemented as part of NDCG. Is it possible to expose it to the users as a real metric?

Ranx pip installing failed

Describe the bug
Error during pip install.

To Reproduce
pip install ranx==0.3.2

Bash output

(xCoFormer) celso@capri:~/projects/xCoFormer$ pip install ranx==0.3.2
Collecting ranx==0.3.2
  Using cached ranx-0.3.2-py3-none-any.whl (93 kB)
Requirement already satisfied: numpy in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.22.4)
Requirement already satisfied: tqdm in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (4.64.1)
Requirement already satisfied: scipy>=1.6.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.9.3)
Collecting lz4
  Using cached lz4-4.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting cbor2
  Using cached cbor2-5.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (224 kB)
Collecting ir-datasets
  Using cached ir_datasets-0.5.4-py3-none-any.whl (311 kB)
Collecting statsmodels
  Using cached statsmodels-0.13.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.0 MB)
Requirement already satisfied: pandas in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.4.4)
Requirement already satisfied: rich in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (12.6.0)
Collecting orjson
  Using cached orjson-3.8.0-cp310-cp310-manylinux_2_28_x86_64.whl (146 kB)
Requirement already satisfied: tabulate in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (0.9.0)
Collecting numba>=0.54.1
  Using cached numba-0.56.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.5 MB)
Collecting llvmlite<0.40,>=0.39.0dev0
  Using cached llvmlite-0.39.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.6 MB)
Requirement already satisfied: setuptools in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from numba>=0.54.1->ranx==0.3.2) (59.6.0)
Requirement already satisfied: zlib-state>=0.1.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.1.5)
Collecting lxml>=4.5.2
  Using cached lxml-4.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.9 MB)
Requirement already satisfied: ijson>=3.1.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (3.1.4)
Requirement already satisfied: trec-car-tools>=2.5.4 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (2.6)
Requirement already satisfied: requests>=2.22.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (2.28.1)
Requirement already satisfied: pyyaml>=5.3.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (6.0)
Collecting pyautocorpus>=0.1.1
  Using cached pyautocorpus-0.1.8.tar.gz (10 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: unlzw3>=0.2.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.1)
Collecting beautifulsoup4>=4.4.1
  Using cached beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
Requirement already satisfied: warc3-wet>=0.2.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.3)
Requirement already satisfied: warc3-wet-clueweb09>=0.2.5 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.5)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from pandas->ranx==0.3.2) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from pandas->ranx==0.3.2) (2022.5)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from rich->ranx==0.3.2) (2.13.0)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from rich->ranx==0.3.2) (0.9.1)
Collecting patsy>=0.5.2
  Using cached patsy-0.5.3-py2.py3-none-any.whl (233 kB)
Requirement already satisfied: packaging>=21.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from statsmodels->ranx==0.3.2) (21.3)
Requirement already satisfied: soupsieve>1.2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from beautifulsoup4>=4.4.1->ir-datasets->ranx==0.3.2) (2.3.2.post1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from packaging>=21.3->statsmodels->ranx==0.3.2) (3.0.9)
Requirement already satisfied: six in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from patsy>=0.5.2->statsmodels->ranx==0.3.2) (1.16.0)
Requirement already satisfied: idna<4,>=2.5 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (1.26.12)
Requirement already satisfied: certifi>=2017.4.17 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (2022.9.24)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (2.1.1)
Requirement already satisfied: cbor>=1.0.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from trec-car-tools>=2.5.4->ir-datasets->ranx==0.3.2) (1.0.0)
Building wheels for collected packages: pyautocorpus
  Building wheel for pyautocorpus (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [13 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'pyautocorpus' extension
      creating build
      creating build/temp.linux-x86_64-3.10
      creating build/temp.linux-x86_64-3.10/src
      x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPCRE_STATIC -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/common -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/wikipedia -I/home/celso/projects/venvs/xCoFormer/include -I/usr/include/python3.10 -c src/Textifier.cpp -o build/temp.linux-x86_64-3.10/src/Textifier.o -std=c++11
      src/Textifier.cpp:40:10: fatal error: pcre.h: No such file or directory
         40 | #include <pcre.h>
            |          ^~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyautocorpus
  Running setup.py clean for pyautocorpus
Failed to build pyautocorpus
Installing collected packages: pyautocorpus, patsy, orjson, lz4, lxml, llvmlite, cbor2, beautifulsoup4, numba, ir-datasets, statsmodels, ranx
  Running setup.py install for pyautocorpus ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for pyautocorpus did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      running install
      /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_ext
      building 'pyautocorpus' extension
      creating build
      creating build/temp.linux-x86_64-3.10
      creating build/temp.linux-x86_64-3.10/src
      x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPCRE_STATIC -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/common -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/wikipedia -I/home/celso/projects/venvs/xCoFormer/include -I/usr/include/python3.10 -c src/Textifier.cpp -o build/temp.linux-x86_64-3.10/src/Textifier.o -std=c++11
      src/Textifier.cpp:40:10: fatal error: pcre.h: No such file or directory
         40 | #include <pcre.h>
            |          ^~~~~~~~
      compilation terminated.
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pyautocorpus

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

Env:

OS: Ubuntu 20.04
Python 3.10.6

[BUG] dcg and dcg_burges do not work in the compare function

Describe the bug

when adding the newly available dcg or dcg_burges metric in the compare function I get this error:

report = compare(
    qrels=qrels,
    runs=runs,
    metrics=["recall@10","ndcg","rbp.90","rbp.50","dcg_burges"],
    max_p=0.05,   # P-value threshold
    stat_test='student'
)

Traceback (most recent call last):
  File "/local/home/mkp/data/gap2kic/eval/./run.py", line 32, in <module>
    print(report)
  File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 338, in __str__
    return self.to_table()
  File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 143, in to_table
    label = self.get_metric_label(x)
  File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 122, in get_metric_label
    return f"{metric_labels[m]}"
KeyError: 'dcg_burges'

however in the same file when I do

res = evaluate(qrels, run, ["recall@10","ndcg","rbp.90","rbp.50","dcg","dcg_burges"])

everything works as intended.

add an option to disable sort_dict_of_dict_by_value when adding results to a run

Hi -- guy with the weird feature requests here 😅 --

Motivation

You don’t want to ask, but, I have some use case where all the documents returned by my system have the same score, however the order matters!
And, when you add_and_sort documents to a run, you end up applying sort_dict_of_dict_by_value, which might reverse the order or completely shuffle the order of document ids:

In [1]: from ranx import Qrels, Run, evaluate

In [2]: run = Run()
   ...: run.add_multi(
   ...:     q_ids=["q_1", "q_2"],
   ...:     doc_ids=[
   ...:         ["doc_12", "doc_23", "doc_25", "doc_36", "doc_32", "doc_35"],
   ...:         ["doc_12", "doc_11", "doc_25", "doc_36", "doc_2",  "doc_35"],
   ...:     ],
   ...:     scores=[
   ...:         [0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   ...:         [0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   ...:     ],
   ...: )
In [3]: list(run.run['q_1'].keys())
Out[3]: ['doc_35', 'doc_32', 'doc_36', 'doc_25', 'doc_23', 'doc_12']

Solution

Obviously, my system could add a slightly negative number to preserve the order of documents, however, this is more of a pain to me than commenting this line.

The request

Would you be be willing to add an option to disable sort_dict_of_dict_by_value when calling add_multi?

Thanks for the quick response on my other issues :)

[BUG] Precision calculation incorrect?

Describe the bug
In the below example, I would expect run1 to have a precision of 1.0 and I would expect both run2 and run3 to have precisions of 0.75, as 3 out of 4 returned documents are relevant. Instead the second query returns 0.5, and the third 0.25. Either there is a bug handling empty query results, or I have a naive misunderstanding of precision. Also, run 2 and run 3 are similar, just with different queries returning null results. Please correct me if I'm wrong!

To Reproduce
Steps to reproduce the behavior:

qrels_dict = {
    "q_1": {"doc_a": 1},
    "q_2": {"doc_b": 1, "doc_c": 1, "doc_d": 1},
    "q_3": {"doc_e": 1},
    "q_4": {"doc_f": 1},
}

run_dict_1 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {"doc_d": 1.0},
    "q_3": {"doc_e": 1.0},
    "q_4": {"doc_f": 1.0},
}

run_dict_2 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {"doc_d": 1.0},
    "q_3": {},
    "q_4": {"doc_f": 1.0},
}

run_dict_3 = {
    "q_1": {"doc_a": 1.0},
    "q_2": {},
    "q_3": {"doc_e": 1.0},
    "q_4": {"doc_f": 1.0},
}


qrels = Qrels(qrels_dict)
run1 = Run(run_dict_1)
run2 = Run(run_dict_2)
run3 = Run(run_dict_3)

print(evaluate(qrels, run1, ["precision"]))
print(evaluate(qrels, run2, ["precision"]))
print(evaluate(qrels, run3, ["precision"]))

1.0
0.5
0.25

[Feature Request] Benchmarking

Is your feature request related to a problem? Please describe.
To claim "blazing-fast", it would be nice to have benchmarks against existing implementations.

Describe the solution you'd like
The implementation is benchmarked against some/all of the sources below:

Why in Recall@k you divide on len(relevant), but not min(len(relevant), k)

The question about Recall@k arose when I looked at the best scores R@1 of Stanford Online Products dataset in paperswithcode https://paperswithcode.com/sota/metric-learning-on-stanford-online-products-1. This benchmark use R@1 metric to measure best models and approach in retrieval task in SOP dataset. Sop dataset has 4.3 images per class (query), so maximum score R@1 with ranx formula would be 1 / 4.3.

In benchmark SOP and many others benchmark they use divide coefficient = min(len(relevant), k).

What do you think about overwrite this coefficient? Why papers with code write R@1 but it's actually not R@1 it's HitRate@1?

[BUG] Could not find a version that satisfies the requirement ranx

Describe the bug
Could not find a version that satisfies the requirement ranx (pip3 install ranx)

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic

pip 21.3.1 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)

Error message:

ERROR: Could not find a version that satisfies the requirement ranx (from versions: none)
ERROR: No matching distribution found for ranx

[Feature Request] from ranx import Report

Is your feature request related to a problem? Please describe.
Hi! I’d like to be able to import Report so that I can load a previously saved report (output of compare) and tweak the runs.

Describe the solution you'd like
from .report import Report

Describe alternatives you've considered
Re-run compare with different runs 😅

[Feature Request] Use black to indent the code

Is your feature request related to a problem? Please describe.

black https://pypi.org/project/black/ is the Uncompromising Code Formatter. You can install it, run black and it will format the code properly. How about using it for the project in order to make sure style is consistent?

amenra / ranx Goto Github PK

ranx's Issues

Motivation

The request

`

Motivation

Solution

The request

Recommend Projects

Recommend Topics

Recommend Org

Jobs