unixjunkie / cpmlib Goto Github PK

View Code? Open in Web Editor NEW

11.0 4.0 0.0 66 KB

Classification and Regression Performance Metrics library

Makefile 0.68% OCaml 96.21% Python 3.11%

auc classification score-label roc bedroc enrichment-factor mcc ranking

cpmlib's Introduction

cpmlib

Classification and Regression Performance Metrics library

cpmlib's People

Contributors

Stargazers

Watchers

cpmlib's Issues

shuffle_then_cut and shuffle_then_nfolds

Generic versions of those two functions could come in a Utils module.

let shuffle_then_cut seed p train_fn =
  match Utls.lines_of_file train_fn with
  | [] | [_] -> assert(false) (* no lines or header line only?! *)
  | (csv_header :: csv_payload) ->
    let rng = BatRandom.State.make [|seed|] in
    let rand_lines = L.shuffle ~state:rng csv_payload in
    let train, test = Utls.train_test_split p rand_lines in
    train_test_dump csv_header train test

let shuffle_then_nfolds seed n train_fn =
  match Utls.lines_of_file train_fn with
  | [] | [_] -> assert(false) (* no lines or header line only?! *)
  | (csv_header :: csv_payload) ->
    let rng = BatRandom.State.make [|seed|] in
    let rand_lines = L.shuffle ~state:rng csv_payload in
    let train_tests = Utls.cv_folds n rand_lines in
    L.rev_map (fun (x, y) -> train_test_dump csv_header x y) train_tests

add logAUC

cf. http://wiki.bkslab.org/index.php/LogAUC

implement precision-recall AUC

we already have ROC AUC but...

add MCC for classification models

Lorenz curve and Gini coefficient

Can be used to compare the ranking power of methods.

re-implement Platt scaling

Currently there is a hack using gnuplot.
If we had an OCaml implementation of Platt scaling, this would remove the dependency to gnuplot.

implement REC

Regression Error Characteristic Curve

Bi, J. and Bennett, K.P., 2003. Regression error characteristic curves. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 43-50).

logAUC

 $LogAUC_\lambda=\frac{\sum_{i}^{where~x_i\ge\lambda} (\log_{10} x_{i+1} - \log_{10} x_i)(\frac{y_{i+1}+y_i}{2})}{\log_{10}\frac{1}{\lambda}}$

RIE = S/<S>
S = sum_over_i (exp (-rank(active_i) / a))
<S> = average S for 1000 trials where molecules (actives and inactives) are randomly sorted.
a << N.
for example, with a of 50, S almost vanishes after 300.
This is a kind of early EF measure.