amenra / ranx Goto Github PK
View Code? Open in Web Editor NEW⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
Home Page: https://amenra.github.io/ranx
License: MIT License
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
Home Page: https://amenra.github.io/ranx
License: MIT License
I understood that, when evaluating MAP@k, relevance judgment scores equal to 0 are ignored.
In my case, I get a bit of a weird behaviour.
I'm working on a balanced dataset with binary relevancy and define qrels by including both 1s and 0s documents.
While ndcg@10 gives me results at about 0.7, MAP@10 is extremely low at about 0.10.
Can this be because, besides the very first documents, the model perform poorly or am I doing something wrong when evaluating?
qrels = Qrels.from_df(
df=test_loaded_pdf,
q_id_col="user_id",
doc_id_col="run_session_id",
score_col="target_binary",
)
run = Run.from_df(
df=test_loaded_pdf,
q_id_col="user_id",
doc_id_col="run_session_id",
score_col="predictions",
)
evaluate(qrels, run, ["map@10", "mrr", "ndcg@10"])
predictions in test_loaded_pdf is not a list of binary relevancy but it's a float relevancy score
Hi! This lib is extremely good tool to have in arsenal, but I think it could be nice to have Spearman and Kendall correlation functions included to this package to evaluate ranking. Maybe not the most popular metrics, but sometimes they could come in handy.
Best regards,
Ivan Savchuk
Hi,
It’d be nice to be able to save a Report comparisons as a JSON file.
However, since it uses frozenset as keys, it is not JSON serializable.
Maybe you could add a method in https://github.com/AmenRa/ranx/blob/master/ranx/frozenset_dict.py to convert the _map
to a JSON serializable dict, i.e. with str
keys?
The str
keys could be converted from the frozenset like: ', '.join(frozenset({'foo', 'bar'}))
Describe the bug
Given the formula of RBP of
\operatorname{RBP} = (1 - p) \cdot \sum_{i=1}^{d}{r_i \cdot p^{i - 1}}
where r_i is the reward/utility, RBP should support multiple relevance levels similar to DCG such that if max relevance level is 2 max rbp value should be 2
It would be a good to additionally calculate the sample variance within the average of the metrics.
Like ndcg@50: 11, stddev: 2.8
Is your feature request related to a problem? Please describe.
First of all: This is a really nice library! It helps a lot!
My request is regarding a recall-precision graphic. When I read TREC-related papers, they very often used the interpolated precision-recall-plot to visualize the performance of IR-Systems which are being compared to each other. They also use the graphs to understand which IR-System yields a higher recall value, shows a higher precision value, and generally has a better performance.
Describe the solution you'd like
Since I love using this library, it would be great, if there were a function for generating such a precision-recall-plot or just an example/notebook for generating this plot using the ranx library in the documentation.
Describe alternatives you've considered
I've already considered creating these plots myself with the functions that the library offers. However, it seems quite complex as the interpolation greatly complicates this work for me.
Additional context
Here is an example interpolated recall-precision plot:
More information can be found on:
https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
and
https://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf (p.4)
Is your feature request related to a problem? Please describe.
I think it’s quite frustrating to have to specify the format of qrels/run (TREC or JSON). I often get exceptions if I forget to specify the 'trec' format because JSON is default.
Describe the solution you'd like
You could infer the format from the file extension: if .json
then JSON else TREC. I can open a PR if you’re interested :)
I want to implement the propensity-scored precision at k (PSP@k) as defined above:
where
Therefore, how could I integrate this metric in ranx?
References:
[1] Zhang, J., Chang, W.C., Yu, H.F. and Dhillon, I., 2021. Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. Advances in Neural Information Processing Systems, 34, pp.7267-7280.
Why is the relevance score mandatory in Qrels?
I don’t see where you use it to compute the metrics https://github.com/AmenRa/ranx/blob/master/ranx/metrics.py
Is there any way to make it optional? If not, would a filler value like 0
for all documents be appropriate?
Describe the bug
Hi, I’m having an error when using compare
with stat_test="student"
(no problem when using the default "fisher").
TypeError Traceback (most recent call last)
<ipython-input-3-0369b81922de> in <module>
7 metrics=["map@100", "mrr@100", "ndcg@10"],
8 stat_test="student",
----> 9 max_p=0.01 # P-value threshold
10 )
/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/meta/compare.py in compare(qrels, runs, metrics, stat_test, n_permutations, max_p, random_seed, threads, rounding_digits, show_percentages)
100 n_permutations=n_permutations,
101 max_p=max_p,
--> 102 random_seed=random_seed,
103 )
104
/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/__init__.py in compute_statistical_significance(model_names, metric_scores, stat_test, n_permutations, max_p, random_seed)
81 n_permutations,
82 max_p,
---> 83 random_seed,
84 )
85
/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/__init__.py in _compute_statistical_significance(control_metric_scores, treatment_metric_scores, stat_test, n_permutations, max_p, random_seed)
38 elif stat_test == "student":
39 p_value, significant = paired_student_t_test(
---> 40 control_metric_scores[m], treatment_metric_scores[m], max_p,
41 )
42
/gpfsdswork/projects/rech/fih/usl47jg/ranx/ranx/statistical_tests/paired_student_t_test.py in paired_student_t_test(control, treatment, max_p)
11
12 """
---> 13 _, p_value = ttest_rel(control, treatment, alternative="two-sided")
14
15 return p_value, p_value <= max_p
TypeError: ttest_rel() got an unexpected keyword argument 'alternative'
To Reproduce
In [1]: from ranx import Qrels, Run
...:
...: qrels_dict = { "q_1": { "d_12": 5, "d_25": 3 },
...: "q_2": { "d_11": 6, "d_22": 1 } }
...:
...: run_dict = { "q_1": { "d_12": 0.9, "d_23": 0.8, "d_25": 0.7,
...: "d_36": 0.6, "d_32": 0.5, "d_35": 0.4 },
...: "q_2": { "d_12": 0.9, "d_11": 0.8, "d_25": 0.7,
...: "d_36": 0.6, "d_22": 0.5, "d_35": 0.4 } }
...:
...: qrels = Qrels(qrels_dict)
...: run = Run(run_dict)
In [2]: from ranx import compare
...:
...: # Compare different runs and perform statistical tests
...: report = compare(
...: qrels=qrels,
...: runs=[run, run],
...: metrics=["map@100", "mrr@100", "ndcg@10"],
...: stat_test="student",
...: max_p=0.01 # P-value threshold
...: )
Versions
ranx==0.2.8
@mponza found a bug when computing recall and promise to document it next week, adding a reminder here for him. Something related to multiple documents having score zero.
Thanks for your amazing work. I am very interested in this framework and try to use it to solve my rank aggregation problems. However, I am a little confused about the usage.
For example, I have scores for several items under different ranking rules, as follows:
item | rank1 | rank2 | rank3
item1 | 0.4 | 0.8 | 0.2
item2 | 0.8 | 0.7 | 0.7
item3 | 0.7 | 0.7 | 1.0
item4 | 0.5 | 0.6 | 0.7
Now I want a comprehensive rank that satisfies all the subrank (e.g., rank1, rank2, rank3) constraints as much as possible.
item | rank
item1 | 0.5
item2 | 0.9
item3 | 0.7
item4 | 0.4
I'm not sure if that can be addressed with ranx. If so, could you show me an example?
Thanks a lot.
Hi,
@osf9018 mentioned it in #2 but I guess it’s better to create a specific issue.
It is often difficult to estimate the total number of relevant document for a query.
For example, in Question Answering, if you have a large enough Knowledge Base, you can find the answer to your question in a surprisingly large number of documents that one cannot annotate in advance. Because of this, the relevance of the document is often estimated on-the-go, by checking whether the answer string is in the document retrieved by the system.
Because of this, recall is not an appropriate metric. However, one way to circumvent this is to compute recall "as if" there was only a single relevant document. After averaging over the whole dataset, it corresponds to the proportion of question for which the system retrieved at least one relevant document in top-K. This is what @osf9018 and I call "hits@K" (I can’t remember but I’ve seen it in a paper) and others, such as Karpukhin et al., call "accuracy". Accuracy is a confusing term IMO.
Would you be interested in implementing or integrating this feature in your library?
It might take some renaming but it could be implemented very easily by using the _hits
function. It is simply min(1, _hits(qrels, run, k))
I was wondering if similar to trec_eval that we can specify the relevance_level, with -l parameter, this feature exists in ranx. If not that would be a useful feature for evaluation
Hi,
Thank you for this nice library.
Is there a fundamental reason why relevance score need to be integers in Qrels?
Thanks.
Attempting to the MRR with your example and am getting a Typing error.
System:
from rank_eval import ndcg, mrr
import numpy as np
# Note that y_true does not need to be ordered
# Integers are documents IDs, while floats are the true relevance scores
y_true = np.array([[[12, 0.5], [25, 0.3]], [[11, 0.4], [2, 0.6]]])
y_pred = np.array(
[
[[12, 0.9], [234, 0.8], [25, 0.7], [36, 0.6], [32, 0.5], [35, 0.4]],
[[12, 0.9], [11, 0.8], [25, 0.7], [36, 0.6], [2, 0.5], [35, 0.4]],
]
)
k = 5
mrr(y_true, y_pred, k, threads=1)
Error message
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
<ipython-input-24-7d69c81d4a8f> in <module>
13 k = 5
14
---> 15 mrr(y_true, y_pred, k, threads=1)
~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py in mrr(y_true, y_pred, k, return_mean, sort, threads)
482 """
483
--> 484 return _choose_optimal_function(
485 y_true=y_true,
486 y_pred=y_pred,
~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py in _choose_optimal_function(y_true, y_pred, f_name, f_single, f_parallel, f_additional_args, return_mean, sort, threads)
234 if sort:
235 y_pred = _descending_sort_parallel(y_pred)
--> 236 scores = f_parallel(y_true, y_pred, **f_additional_args)
237 if return_mean:
238 return np.mean(scores)
~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
399 e.patch_message(msg)
400
--> 401 error_rewrite(e, 'typing')
402 except errors.UnsupportedError as e:
403 # Something unsupported is present in the user code, add help info
~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
342 raise e
343 else:
--> 344 reraise(type(e), e, None)
345
346 argtypes = []
~/anaconda3/envs/deeplearning/lib/python3.8/site-packages/numba/core/utils.py in reraise(tp, value, tb)
77 value = tp()
78 if value.__traceback__ is not tb:
---> 79 raise value.with_traceback(tb)
80 raise value
81
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<built-in function contains>) with argument(s) of type(s): (array(float64, 1d, A), float64)
* parameterized
In definition 0:
All templates rejected with literals.
In definition 1:
All templates rejected without literals.
In definition 2:
All templates rejected with literals.
In definition 3:
All templates rejected without literals.
In definition 4:
All templates rejected with literals.
In definition 5:
All templates rejected without literals.
In definition 6:
All templates rejected with literals.
In definition 7:
All templates rejected without literals.
In definition 8:
All templates rejected with literals.
In definition 9:
All templates rejected without literals.
In definition 10:
All templates rejected with literals.
In definition 11:
All templates rejected without literals.
In definition 12:
All templates rejected with literals.
In definition 13:
All templates rejected without literals.
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (78)
File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 78:
def _reciprocal_rank(y_true, y_pred, k):
<source elided>
for i in range(k):
if y_pred[i, 0] in y_true[:, 0]:
^
[1] During: resolving callee type: type(CPUDispatcher(<function _reciprocal_rank at 0x7f6697ed9dc0>))
[2] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (11)
[3] During: resolving callee type: type(CPUDispatcher(<function _reciprocal_rank at 0x7f6697ed9dc0>))
[4] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (11)
File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 11:
def _parallelize(f, y_true, y_pred, k):
<source elided>
for i in prange(len(y_true)):
scores[i] = f(y_true[i], y_pred[i], k)
^
[1] During: resolving callee type: type(CPUDispatcher(<function _parallelize at 0x7f6697ed4e50>))
[2] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (85)
[3] During: resolving callee type: type(CPUDispatcher(<function _parallelize at 0x7f6697ed4e50>))
[4] During: typing of call at /home/doc/anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py (85)
File "../../../anaconda3/envs/deeplearning/lib/python3.8/site-packages/rank_eval/metrics.py", line 85:
def _mrr(y_true, y_pred, k):
return _parallelize(_reciprocal_rank, y_true, y_pred, k)
^
Describe the bug
I am evaluating my test set against my algorithm recommendations, but it gives the error
Qrels and Run query ids do not match
To Reproduce
Steps to reproduce the behavior:
metrics = ["recall@10", "mrr@10","ndcg@10"]
person_date_indexs = df_recs_train_top10['person_date_index'].unique()
run = tranform_recs_to_ranx(df_recs_train_top10, person_date_indexs, "person_date_index", "gym_id", "rank")
qrels = tranform_test_to_ranx(df_test_checkins_new_col_renamed, df_recs_train_top10)
evaluate(qrels, run, metrics)
Expected behavior
print the recall, mrr and ndcg
Desktop (please complete the following information):
Ubuntu 22.04
Additional context
It worked fine with another test set. The only difference is that in this test set I removed items that the user already interacted in train.
Describe the bug
Qrels.from_ir_datasets("msmarco-document/dev") does not seem to work, it seems to have been removed by a commit 10 days ago. (e0ca82c)
Hi Elias,
Is your feature request related to a problem? Please describe.
I've noticed that Run
(and I guess also Qrels
) consume a lot of memory (RAM) compared to standard python dict
, e.g. a few GB instead of a few 100s of MB. This gets problematic for somewhat large datasets (e.g. 1M queries)
Describe the solution you'd like
I guess it's related to Numba representation? I've no clue on how to make it more efficient, sorry :)
Reproduce
Just open your system monitor and see how the memory grows.
In [1]: import ranx
# this weighs only a few 100s of MB
In [2]: run_d = {str(i): {str(j): 0.0 for j in range(100)} for i in range(100000)}
# this grows to a few GB
In [3]: run_r = ranx.Run(run_d)
Best,
Paul
It’s all in the title :)
Line 140 in f41e64d
Is your feature request related to a problem? Please describe.
Instead of manually trying every possible fusion techniques, it’d be nice to be able to grid-search all of them, as optimize_fusion
is already doing for fusion hyperparameters (e.g. weights in wsum
).
Describe the solution you'd like
Allow to pass List[str]
instead of str
for norm
and method
parameters.
Then do something like:
for norm in norms:
for method in methods:
optimize_fusion_current_behavior(*args, norm=norm, method=method, **kwargs)
The trick is to report the results in a readable manner.
Hello,
Thank you for your amazing work. I am trying to use supervised fusion methods like this:
best_params = optimize_fusion(
qrels=qrels,
runs=[run_4, run_5],
norm='min-max', # Default value
method='posfuse',
metric="mrr@10",
)
combined_run = fuse(
runs=[run_4, run_5],
norm='min-max', # Default value
method='posfuse', # Alias for Weighted Sum
params=best_params,
)
I tried changing the norm and metric to every possible value but I still get "Segmentation fault (core dumped)" error.
I could not find any hints in the documentation about. sorry if I am missing something but can you help me with using these fusion methods?
thanks,
Describe the bug
Bug when reconstructing Qrels from a pandas dataframe. This bug affects also when reading a Qrel from a parquet file as the pandas to Qrels is used internally.
Pandas version: 1.5.2
Ranx: last pip version
To Reproduce
Code:
from ranx import Qrels
qrels = Qrels({'1':{'1':1}, '2':{'2': 1}})
df = qrels.to_dataframe()
Qrels.from_df(df)
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...\lib\site-packages\ranx\data_structures\qrels.py", line 300, in from_df
assert df[score_col].dtype == int, "DataFrame scores column dtype must be `int`"
AssertionError: DataFrame scores column dtype must be `int`
About the dataframe dtypes
It is using int64 instead of int
>>> df.dtypes
q_id object
doc_id object
score int64
dtype: object
>>> import numpy as np
>>> df.dtypes['score'] == np.int64
True
Hi,
Thank you for this nice library.
How would you go about implementing Reciprocal Rank Fusion in ranx?
Thanks,
Maxime.
Is your feature request related to a problem? Please describe.
I'm using Pandas dataframes, and comparing different embedding models. I'd like to be able to name my runs so when I compare them, the report shows something other than run_1
, run_2
.
Describe the solution you'd like
Allow name
as a named parameter in Run.from_df
Describe alternatives you've considered
You can just set the name afterward, e.g.
my_run.name = "My Run"
Additional context
I guess this is just for consistency, since Run.from_file
allows you to pass a name.
Describe the bug
I'm using the library for the first time with a Pandas dataframe and ran into an exception that was misleading.
To Reproduce
Steps to reproduce the behavior:
id
column is of type int64
e.g. df['id'] = df.index + 1
qrels = Qrels.from_df(
df=df,
q_id_col="id",
doc_id_col="best_document",
score_col="score",
)
[/usr/local/lib/python3.10/dist-packages/ranx/data_structures/qrels.py](https://localhost:8080/#) in from_df(df, q_id_col, doc_id_col, score_col)
293 """
294 assert (
--> 295 df[q_id_col].dtype == "O"
296 ), "DataFrame scores column dtype must be `object` (string)"
297 assert (
AssertionError: DataFrame scores column dtype must be `object` (string)
Expected behavior
The assertion message should point to the correct column, in this case, it is the ID column that is of the wrong type. From inspecting the code, the assertion message is wrong when the document ID column is of the wrong type as well.
Describe the bug
MRR@1
should be equal to Recall@1
. However, these metrics diverge for the case below.
To Reproduce
%%capture
!pip install ranx
from ranx import Qrels, Run, evaluate
import pickle
# download files from https://drive.google.com/drive/folders/1ZLyB6mKKiQsypw36nhdZ4dGqmFw27K-3?usp=sharing
with open("qrels.pkl", "rb") as f:
qrels = pickle.load(f)
with open("run.pkl", "rb") as f:
run = pickle.load(f)
evaluate(
Qrels(qrels),
Run(run),
['mrr@1', 'mrr@5', 'mrr@10', 'recall@1', 'recall@5', 'recall@10'])
# {'mrr@1': 0.8133879123525163,
# 'mrr@5': 0.820242395055783,
# 'mrr@10': 0.8206332007078454,
# 'recall@1': 0.04814167526511499,
# 'recall@5': 0.05089848464127321,
# 'recall@10': 0.05171913427724859}
or use Google Colab.
Expected behavior
mrr@1
=recall@1
Am I missing something?
from ranx import Qrels
from ranx import Run
from ranx import evaluate
qrels_dict = {
"text_1": {
"label_1": 1
},
"text_2":{
"label_2": 1,
}
}
qrels = Qrels(qrels_dict, name="testing")
run_dict = {
"text_1": {
"label_1": 1,
"label_2": 0.9,
"label_3": 0.8,
"label_4": 0.7,
"label_5": 0.6,
"label_6": 0.5,
"label_7": 0.4,
"label_8": 0.3,
"label_9": 0.2,
"label_10": 0.1,
},
"text_2": {
"label_1": 0.9,
"label_2": 1,
"label_3": 0.8,
"label_4": 0.7,
"label_5": 0.6,
"label_6": 0.5,
"label_7": 0.4,
"label_8": 0.3,
"label_9": 0.2,
"label_10": 0.1,
},
}
run = Run(run_dict, name="bm25")
evaluate(qrels, run, ["mrr@1", "mrr@5", "mrr@10"])
CPU times: user 55.2 s, sys: 467 ms, total: 55.7 s
Wall time: 56.8 s
This behavior can be reproduced in this Google Colab Notebook 1 and also in this Google Colab Notebook 2 (spent time per steps).
Using f1 or _f1_parallel for all qrels and run gives incorrect output. But if I use _f1 on individual query case, it gives correct F1 score.
using below 2 functions return 0 for 4 cases. Ideally it should only be 0 for 1 of all the 18 cases in _qrels & _run passed.
from ranx.metrics.f1 import _f1_parallel, _f1, f1
f1(_qrels, _run, 1, 1)
# output
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])
_f1_parallel(_qrels, _run, 1 , 1)
# output
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.])
But if I call _f1 for each item in qrels, I get correct F1 score for all queries.
scores = []
for i in prange(len(_qrels)):
try:
scores.append(_f1(_qrels[i], _run[i], 1 , 1))
except Exception as error:
# handle the exception
print(f" {i} An exception occurred:", error)
scores.append(0)
continue
# output
# [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0]
Notice, I get only 1 case where the F1 score is 0, which I think is expected for my case.
More context -
I am using f1 metric at k =1 for a dataset where each query has only 1 relevant document. There are 18 unique qrels.
Since each query only has 1 relevant document, the score for mrr@1 = recall@1 = 0.9444
also, precision@1 = 0.944 for my case.
Here's the output of my evaluation -
{'mrr@1': 0.9444444444444444, 'mrr@2': 0.9444444444444444, 'recall@1': 0.9444444444444444, 'recall@2': 0.9444444444444444, 'precision@1': 0.9444444444444444, 'f1@1': 0.7777777777777778}
f1@1 seems very low considering precision & recall @ 1 are equal with high value.
Here's the pretty output of scores for each individual query -
I've manually validated the output for all the metrics and all of them seem correct to me except for f1@1 .
Notice that hits score is 0 only for q_id '6', so f1 score 0 for q_id '6' is expected, but f1 score is also 0 for q_id '7' , '8' and '9' .
{
"mrr": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.3333333333333333,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"mrr@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"recall@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"precision@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
},
"f1@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 0.0,
"8": 0.0,
"9": 0.0
},
"hits@1": {
"0": 1.0,
"1": 1.0,
"10": 1.0,
"11": 1.0,
"12": 1.0,
"13": 1.0,
"14": 1.0,
"15": 1.0,
"16": 1.0,
"17": 1.0,
"2": 1.0,
"3": 1.0,
"4": 1.0,
"5": 1.0,
"6": 0.0,
"7": 1.0,
"8": 1.0,
"9": 1.0
}
}
Thanks in advance.
Let me me know if I am missing something.
Missing results causes AssertionError: Qrels and Run query ids do not match
You should make check_keys
optional in evaluate because sometimes queries do not return any results for lexical-based systems.
Line 128 in da0aa52
Is your feature request related to a problem? Please describe.
trec files can be several megabytes, for example run_*.trec
used for the examples are all more than 20Mb, but once compressed they become less than 10. That would make downloads faster and also loading the file in memory.
Describe the solution you'd like
support *.trec.gz
.
Describe alternatives you've considered
It would cool to evaluate also alternative formats to store the trec file, like [parquet] (https://arrow.apache.org/docs/python/parquet.html), this library focus on computing metrics fast, but if you spend ages to load/parse the trec file it is not very useful - parquet is much faster to load in memory and it is supports compression natively.
Additional context
💨
Hi,
compare
does not have a rounding_digits
argument and thus always uses the default from Report
(which is 4). Why is that?
Also, would you like to add an option in Report to print results as percentages rather than ratios ?
Hi,
I tested your code and found that it was easy to use and integrate. Moreover, the results I got are fully coherent with those I previously obtained with a personal implementation of trec_eval and the computation of the measures is fast. This is clearly an interesting software and its presentation to the demo session of ECIR 2022 is a good thing.
I had only a problem with the R-precision measure. The main problem is that if you replace ""ndcg@5" in the 4th cell of the overview.ipynb notebook, you get:
TypeError Traceback (most recent call last)
/tmp/ipykernel_28676/2318072837.py in
1 # Compute NDCG@5
----> 2 evaluate(qrels, run, "r-precision")
/vol/data/ferret/tools-distrib/_research_code/rank_eval/rank_eval/meta_functions.py in evaluate(qrels, run, metrics, return_mean, threads, save_results_in_run)
149 for m, scores in metric_scores_dict.items():
150 for i, q_id in enumerate(run.get_query_ids()):
--> 151 run.scores[m][q_id] = scores[i]
152 # Prepare output -----------------------------------------------------------
153 if return_mean:
TypeError: 'numpy.float64' object does not support item assignment
`
I first detected the problem through the integration of your code and obtained the same error. By looking at the file meta_functions.py where the problem arises:
143 if type(run) == Run and save_results_in_run:
144 for m, scores in metric_scores_dict.items():
145 if m not in ["r_precision", "r-precision"]:
146 run.mean_scores[m] = np.mean(scores)
147 else:
148 run.scores[m] = np.mean(scores)
149 for m, scores in metric_scores_dict.items():
150 for i, q_id in enumerate(run.get_query_ids()):
151 run.scores[m][q_id] = scores[i]
I saw your recent last update of this part of the code but there is still a problem since for R-precision, the mean of the scores is stored in run.score and not in run.mean_scores. As a consequence, the use of run.scores for storing the score of each query raises a problem if both return_mean and save_results_in_run flags are set to True. More globally, I am not sure to understand why you differentiate R-precision from the other measures concerning the computation of the mean score.
Thank you by advance for your efforts for fixing the issue.
Olivier
https://github.com/AmenRa/rank_eval/blob/master/rank_eval/meta_functions.py#L220
control_metric_scores
could be defined outside the for j
loop :)
Describe the bug
When converting a Report to dict, you only keep one metric
while iterating over metrics
(overwriting the previous metric
in each loop)
https://github.com/AmenRa/ranx/blob/master/ranx/report.py#L315
How to fix
replace the line above with:
d[m1]["win_tie_loss"][m2][metric] = self.win_tie_loss[(m1, m2)][metric]
and init d[m1]["win_tie_loss"][m2] = {}
at the same place as
https://github.com/AmenRa/ranx/blob/master/ranx/report.py#L309 (just above)
First of all, great work on this code. I have been looking for a definitive package to evaluate ranking models and I believe this is that package.
My question is perhaps a bit out of the domain, but it could help others in the future. How would you deal with comparing 2 models where each has multiple runs (e.g., runs with different random initialization and/or batch shuffling, for confidence intervals). I was thinking that perhaps the significance testing could be performed between the mean (across runs) metric_scores vectors.
Thanks in advance,
Tiago
Is your feature request related to a problem? Please describe.
Hi, you’ve done a great job implementing plenty of different fusion algorithms, but I think it will always be a bottleneck.
What would you think about letting the user define their own training function?
Describe the solution you'd like
For example, in optimize_fusion, allow method
to be a callable
and in this case, do not call has_hyperparams
and optimization_switch
.
Describe alternatives you've considered
My use case/ Ma et al.
By the way, at the moment, my use case is to use the default-minimum trick of Ma et al.: when combining results from systems A and B, it consists in giving the minimum score of A's results if a given document was only retrieved by system B, and vice-versa.
Maybe this is already possible in ranx via some option/method named differently? Or maybe you’d like to add it in the core ranx fusion algorithms?
Hi,
Just a quick question: I wondered what motivated the choice for changing the default value of the stat test to student instead of fisher 0dc8d9c (I almost published them as is before figuring it out 😅).
I thought that one of your documentation pointed me to this paper of Smucker et al. that suggests using Fisher (and especially not Student), but maybe I don’t recall correctly.
Btw, the docstring still shows 'fisher' as default https://amenra.github.io/ranx/compare/
Hello,
I'd like to determine what query is causing the following error and how to get around it:
Traceback (most recent call last):
File "main.py", line 43, in perform_tasks
eval(params)
File "main.py", line 25, in eval
eval_helper.perform_eval()
File "/home/celso/projects/XMTC-Baselines/source/helper/EvalHelper.py", line 62, in perform_eval
qrels = Qrels(filtered_relevance_map)
File "/home/celso/projects/venvs/XMTC-Baselines/lib/python3.8/site-packages/ranx/data_structures/qrels.py", line 62, in __init__
max_len = max(len(y) for x in doc_ids for y in x)
ValueError: max() arg is an empty sequence
My evaluation code is shown in the code snippet below.
ranking = self._retrieve(...)
filtered_relevance_map= {key: value for key, value in self.relevance_map.items() if key in ranking.keys()}
qrels = Qrels(filtered_relevance_map)
run = Run(ranking, name=cls)
result = evaluate(qrels, run, self.metrics, threads=12)
In industry DCG (in both formulations) is a standard and widely used metric. I see it is already implemented as part of NDCG. Is it possible to expose it to the users as a real metric?
Describe the bug
Error during pip install.
To Reproduce
pip install ranx==0.3.2
Bash output
(xCoFormer) celso@capri:~/projects/xCoFormer$ pip install ranx==0.3.2
Collecting ranx==0.3.2
Using cached ranx-0.3.2-py3-none-any.whl (93 kB)
Requirement already satisfied: numpy in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.22.4)
Requirement already satisfied: tqdm in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (4.64.1)
Requirement already satisfied: scipy>=1.6.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.9.3)
Collecting lz4
Using cached lz4-4.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting cbor2
Using cached cbor2-5.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (224 kB)
Collecting ir-datasets
Using cached ir_datasets-0.5.4-py3-none-any.whl (311 kB)
Collecting statsmodels
Using cached statsmodels-0.13.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.0 MB)
Requirement already satisfied: pandas in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (1.4.4)
Requirement already satisfied: rich in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (12.6.0)
Collecting orjson
Using cached orjson-3.8.0-cp310-cp310-manylinux_2_28_x86_64.whl (146 kB)
Requirement already satisfied: tabulate in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ranx==0.3.2) (0.9.0)
Collecting numba>=0.54.1
Using cached numba-0.56.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.5 MB)
Collecting llvmlite<0.40,>=0.39.0dev0
Using cached llvmlite-0.39.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.6 MB)
Requirement already satisfied: setuptools in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from numba>=0.54.1->ranx==0.3.2) (59.6.0)
Requirement already satisfied: zlib-state>=0.1.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.1.5)
Collecting lxml>=4.5.2
Using cached lxml-4.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.9 MB)
Requirement already satisfied: ijson>=3.1.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (3.1.4)
Requirement already satisfied: trec-car-tools>=2.5.4 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (2.6)
Requirement already satisfied: requests>=2.22.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (2.28.1)
Requirement already satisfied: pyyaml>=5.3.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (6.0)
Collecting pyautocorpus>=0.1.1
Using cached pyautocorpus-0.1.8.tar.gz (10 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: unlzw3>=0.2.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.1)
Collecting beautifulsoup4>=4.4.1
Using cached beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
Requirement already satisfied: warc3-wet>=0.2.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.3)
Requirement already satisfied: warc3-wet-clueweb09>=0.2.5 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from ir-datasets->ranx==0.3.2) (0.2.5)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from pandas->ranx==0.3.2) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from pandas->ranx==0.3.2) (2022.5)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from rich->ranx==0.3.2) (2.13.0)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from rich->ranx==0.3.2) (0.9.1)
Collecting patsy>=0.5.2
Using cached patsy-0.5.3-py2.py3-none-any.whl (233 kB)
Requirement already satisfied: packaging>=21.3 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from statsmodels->ranx==0.3.2) (21.3)
Requirement already satisfied: soupsieve>1.2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from beautifulsoup4>=4.4.1->ir-datasets->ranx==0.3.2) (2.3.2.post1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from packaging>=21.3->statsmodels->ranx==0.3.2) (3.0.9)
Requirement already satisfied: six in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from patsy>=0.5.2->statsmodels->ranx==0.3.2) (1.16.0)
Requirement already satisfied: idna<4,>=2.5 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (1.26.12)
Requirement already satisfied: certifi>=2017.4.17 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (2022.9.24)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from requests>=2.22.0->ir-datasets->ranx==0.3.2) (2.1.1)
Requirement already satisfied: cbor>=1.0.0 in /home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages (from trec-car-tools>=2.5.4->ir-datasets->ranx==0.3.2) (1.0.0)
Building wheels for collected packages: pyautocorpus
Building wheel for pyautocorpus (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [13 lines of output]
running bdist_wheel
running build
running build_ext
building 'pyautocorpus' extension
creating build
creating build/temp.linux-x86_64-3.10
creating build/temp.linux-x86_64-3.10/src
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPCRE_STATIC -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/common -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/wikipedia -I/home/celso/projects/venvs/xCoFormer/include -I/usr/include/python3.10 -c src/Textifier.cpp -o build/temp.linux-x86_64-3.10/src/Textifier.o -std=c++11
src/Textifier.cpp:40:10: fatal error: pcre.h: No such file or directory
40 | #include <pcre.h>
| ^~~~~~~~
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pyautocorpus
Running setup.py clean for pyautocorpus
Failed to build pyautocorpus
Installing collected packages: pyautocorpus, patsy, orjson, lz4, lxml, llvmlite, cbor2, beautifulsoup4, numba, ir-datasets, statsmodels, ranx
Running setup.py install for pyautocorpus ... error
error: subprocess-exited-with-error
× Running setup.py install for pyautocorpus did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
running install
/home/celso/projects/venvs/xCoFormer/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_ext
building 'pyautocorpus' extension
creating build
creating build/temp.linux-x86_64-3.10
creating build/temp.linux-x86_64-3.10/src
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPCRE_STATIC -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/common -I/tmp/pip-install-g3fc194j/pyautocorpus_51c04bb4f6a3404393baddede408d7df/AutoCorpus/src/wikipedia -I/home/celso/projects/venvs/xCoFormer/include -I/usr/include/python3.10 -c src/Textifier.cpp -o build/temp.linux-x86_64-3.10/src/Textifier.o -std=c++11
src/Textifier.cpp:40:10: fatal error: pcre.h: No such file or directory
40 | #include <pcre.h>
| ^~~~~~~~
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> pyautocorpus
note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
Env:
Describe the bug
when adding the newly available dcg
or dcg_burges
metric in the compare function I get this error:
report = compare(
qrels=qrels,
runs=runs,
metrics=["recall@10","ndcg","rbp.90","rbp.50","dcg_burges"],
max_p=0.05, # P-value threshold
stat_test='student'
)
Traceback (most recent call last):
File "/local/home/mkp/data/gap2kic/eval/./run.py", line 32, in <module>
print(report)
File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 338, in __str__
return self.to_table()
File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 143, in to_table
label = self.get_metric_label(x)
File "/home/mkp/.asdf/installs/python/3.10.9/lib/python3.10/site-packages/ranx/data_structures/report.py", line 122, in get_metric_label
return f"{metric_labels[m]}"
KeyError: 'dcg_burges'
however in the same file when I do
res = evaluate(qrels, run, ["recall@10","ndcg","rbp.90","rbp.50","dcg","dcg_burges"])
everything works as intended.
Hi -- guy with the weird feature requests here 😅 --
You don’t want to ask, but, I have some use case where all the documents returned by my system have the same score, however the order matters!
And, when you add_and_sort
documents to a run, you end up applying sort_dict_of_dict_by_value
, which might reverse the order or completely shuffle the order of document ids:
In [1]: from ranx import Qrels, Run, evaluate
In [2]: run = Run()
...: run.add_multi(
...: q_ids=["q_1", "q_2"],
...: doc_ids=[
...: ["doc_12", "doc_23", "doc_25", "doc_36", "doc_32", "doc_35"],
...: ["doc_12", "doc_11", "doc_25", "doc_36", "doc_2", "doc_35"],
...: ],
...: scores=[
...: [0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
...: [0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
...: ],
...: )
In [3]: list(run.run['q_1'].keys())
Out[3]: ['doc_35', 'doc_32', 'doc_36', 'doc_25', 'doc_23', 'doc_12']
Obviously, my system could add a slightly negative number to preserve the order of documents, however, this is more of a pain to me than commenting this line.
Would you be be willing to add an option to disable sort_dict_of_dict_by_value
when calling add_multi
?
Thanks for the quick response on my other issues :)
Describe the bug
In the below example, I would expect run1 to have a precision of 1.0 and I would expect both run2 and run3 to have precisions of 0.75, as 3 out of 4 returned documents are relevant. Instead the second query returns 0.5, and the third 0.25. Either there is a bug handling empty query results, or I have a naive misunderstanding of precision. Also, run 2 and run 3 are similar, just with different queries returning null results. Please correct me if I'm wrong!
To Reproduce
Steps to reproduce the behavior:
qrels_dict = {
"q_1": {"doc_a": 1},
"q_2": {"doc_b": 1, "doc_c": 1, "doc_d": 1},
"q_3": {"doc_e": 1},
"q_4": {"doc_f": 1},
}
run_dict_1 = {
"q_1": {"doc_a": 1.0},
"q_2": {"doc_d": 1.0},
"q_3": {"doc_e": 1.0},
"q_4": {"doc_f": 1.0},
}
run_dict_2 = {
"q_1": {"doc_a": 1.0},
"q_2": {"doc_d": 1.0},
"q_3": {},
"q_4": {"doc_f": 1.0},
}
run_dict_3 = {
"q_1": {"doc_a": 1.0},
"q_2": {},
"q_3": {"doc_e": 1.0},
"q_4": {"doc_f": 1.0},
}
qrels = Qrels(qrels_dict)
run1 = Run(run_dict_1)
run2 = Run(run_dict_2)
run3 = Run(run_dict_3)
print(evaluate(qrels, run1, ["precision"]))
print(evaluate(qrels, run2, ["precision"]))
print(evaluate(qrels, run3, ["precision"]))
1.0
0.5
0.25
Is your feature request related to a problem? Please describe.
To claim "blazing-fast", it would be nice to have benchmarks against existing implementations.
Describe the solution you'd like
The implementation is benchmarked against some/all of the sources below:
The question about Recall@k arose when I looked at the best scores R@1 of Stanford Online Products dataset in paperswithcode https://paperswithcode.com/sota/metric-learning-on-stanford-online-products-1. This benchmark use R@1 metric to measure best models and approach in retrieval task in SOP dataset. Sop dataset has 4.3 images per class (query), so maximum score R@1 with ranx formula would be 1 / 4.3
.
In benchmark SOP and many others benchmark they use divide coefficient = min(len(relevant), k)
.
What do you think about overwrite this coefficient? Why papers with code write R@1 but it's actually not R@1 it's HitRate@1?
Describe the bug
Could not find a version that satisfies the requirement ranx (pip3 install ranx
)
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
Codename: bionic
pip 21.3.1 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)
Error message:
ERROR: Could not find a version that satisfies the requirement ranx (from versions: none)
ERROR: No matching distribution found for ranx
Is your feature request related to a problem? Please describe.
Hi! I’d like to be able to import Report so that I can load a previously saved report (output of compare
) and tweak the runs.
Describe the solution you'd like
from .report import Report
Describe alternatives you've considered
Re-run compare
with different runs 😅
Is your feature request related to a problem? Please describe.
black
https://pypi.org/project/black/ is the Uncompromising Code Formatter. You can install it, run black
and it will format the code properly. How about using it for the project in order to make sure style is consistent?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.