privong / pymccorrelation Goto Github PK

View Code? Open in Web Editor NEW

10.0 1.0 3.0 77 KB

Correlation coefficients with uncertainties

License: GNU General Public License v3.0

Python 100.00%

bootstrapping-statistics monte-carlo correlation-coefficient

pymccorrelation's People

Contributors

Stargazers

Watchers

Forkers

kiwasawa ysong-astro bennett-skinner

pymccorrelation's Issues

Check x/y limit flags are in [0, 1, -1]

Protect against the user entering limit flags which are not 0, -1, or 1. Throw an error.

Speed-up Kendall's tau w/ censoring

The implementation of Kendall's tau method for censored data is slow. This is likely due in part to the nested for loops. The code should be looked at to see if it can be sped up (by vectorization or other approaches).

Minor misspelling on README

Hey, just a minor misspelling on the README both here and on the pypi page that could give some trouble in the future, so I thought I could give you a heads up.
On the example to compute the Pearson's r for a sample, using 1000 bootstrapping iterations, you have the formula:
res = pymccorrelation(data['x'], data['y]',
coeff='pearsonr',
Nboot=1000)

The data for the y variable is typed data['y]' with que quotes off in one side, as it should be data['y'], if copy and pasted directly this simple error can cause some damage until repaired.
That's all, thanks for the package, it really helped me.

More flexible error distributions

As of now, the library doesn't allow non-symmetric uncertainties, or values at a boundary (in which the uncertainty is one sided).
These cases are extremely common for e.g. astrophysics fitting results. I've done it myself locally and it's a very minor change. I think a proper implementation up to the standards of such an important library would be very beneficiary to the community.

De-duplicate bootstrapping / perturbation code

Instead of separate pymcspearman() and pymckendalltau() functions, provide a single pymccorrelation() function where the user can specify the correlation coefficient to calculate:

Spearman's Rank Correlation Coefficient
Kendall Rank Correlation Coefficient (optionally considering censored data)
others...

If Nperturb > Nboot, only Nboot instances done

Need to update

pymccorrelation/pymccorrelation/pymccorrelation.py

Line 288 in 4155244

if Nperturb is not None:

to do multiple iterations if there's multiple point perturbations done. Also, re-cast the argument as the number of perturbations per bootstrap.

Feature request: tests of data drawn from distributions

Apply the monte carlo (bootstrapping and perturbation) methods to statistical tests comparing data with empirical or parametric probability distributions.

Tests to Consider for Implementation

Top Priority

Anderson-Darling Test (scipy)
Kolmogorov-Smirnov Test (data with distribution [scipy], data with other data [scipy])
Shapiro-Wik (scipy).
- The scipy implementation does not handle censored data, but an algorithm for that is given in ALGORITHM AS R94 APPL. STATIST. (1995) VOL. 44, NO. 4.

Lower priority

Fligner-Killeen (equality of variance, scipy)
Mood’s median test (scipy)

Add pearson

Implement the pearson correlation coefficient, after finishing the single-interface change (#2).

Convergence of bootstrapping results, `nan` values for small sample sizes

If the sample size is small (or the number of bootstraps is large), the correlation coefficients can be undefined and return nan values. The use of np.percentile() then returns nan from pymccorrelation(). If there's many nan values this probably suggests the bootstrapping is not well-converged. When looking at the mock dataset to check recovery (#4), the convergence of bootstrapping would be good to consider.

Ultimately, decide if nanpercentile() should be used, optionally with a warning if the size of the dataset is too small for reliable bootstrap error estimation.

There is probably statistics literature about this too...

Add usage information to README

Simplify function calls

In functions such as pymcspearman(), having bootstrap=True and Nboot=10000 is redundant. Eliminate the flag argument and set Nboot=None by default. Non-none values will enable bootstrapping or perturbation.

Migrate RNG to `default_rng()`

Since v1.17, numpy has a new procedure for generating pseudorandom sequences. Migrate the code to this:
https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.default_rng

Mock dataset to test recoverability

From Ilsang:

I am really curious if the output correlation coefficient recovers the intrinsic correlation coefficient within certain confidence interval. The correlation coefficient was defined by cov(X,Y)/{var(X)*var(Y)}, so you can easily generate random data points with given covariance matrix and you know what the input correlation coefficient is. If you do simulations of measuring correlation coefficient of the simulated dataset (using your program) with different sample size and measurement errors, how reliably can you recover the input correlation coefficient?

https://scipy-cookbook.readthedocs.io/items/CorrelatedRandomSamples.html

Bug: Bootstrapping not propagating limit flags

For computing Kendall's tau with censored data, the bootstrapping code does not also resample the limit flags when resampling the data points.

Warn user if Nboot > Npermutations

If the number of bootstraps requested is larger than the number of possible permutations, duplicate coefficient/p-value results will be returned. This will lead to over-representation of that value in the probability distribution.

This check can be done at the start, when data lengths are checked for consistency:

pymccorrelation/pymccorrelation/pymccorrelation.py

Line 215 in 4155244

# do some checks on input array lengths and ensure the necessary data