privong / pymccorrelation Goto Github PK
View Code? Open in Web Editor NEWCorrelation coefficients with uncertainties
License: GNU General Public License v3.0
Correlation coefficients with uncertainties
License: GNU General Public License v3.0
Protect against the user entering limit flags which are not 0, -1, or 1. Throw an error.
The implementation of Kendall's tau method for censored data is slow. This is likely due in part to the nested for
loops. The code should be looked at to see if it can be sped up (by vectorization or other approaches).
Hey, just a minor misspelling on the README both here and on the pypi page that could give some trouble in the future, so I thought I could give you a heads up.
On the example to compute the Pearson's r for a sample, using 1000 bootstrapping iterations, you have the formula:
res = pymccorrelation(data['x'], data['y]',
coeff='pearsonr',
Nboot=1000)
The data for the y variable is typed data['y]' with que quotes off in one side, as it should be data['y'], if copy and pasted directly this simple error can cause some damage until repaired.
That's all, thanks for the package, it really helped me.
As of now, the library doesn't allow non-symmetric uncertainties, or values at a boundary (in which the uncertainty is one sided).
These cases are extremely common for e.g. astrophysics fitting results. I've done it myself locally and it's a very minor change. I think a proper implementation up to the standards of such an important library would be very beneficiary to the community.
Instead of separate pymcspearman()
and pymckendalltau()
functions, provide a single pymccorrelation()
function where the user can specify the correlation coefficient to calculate:
Need to update
to do multiple iterations if there's multiple point perturbations done. Also, re-cast the argument as the number of perturbations per bootstrap.Apply the monte carlo (bootstrapping and perturbation) methods to statistical tests comparing data with empirical or parametric probability distributions.
ALGORITHM AS R94 APPL. STATIST. (1995) VOL. 44, NO. 4.
Implement the pearson correlation coefficient, after finishing the single-interface change (#2).
If the sample size is small (or the number of bootstraps is large), the correlation coefficients can be undefined and return nan
values. The use of np.percentile()
then returns nan from pymccorrelation()
. If there's many nan
values this probably suggests the bootstrapping is not well-converged. When looking at the mock dataset to check recovery (#4), the convergence of bootstrapping would be good to consider.
Ultimately, decide if nanpercentile()
should be used, optionally with a warning if the size of the dataset is too small for reliable bootstrap error estimation.
There is probably statistics literature about this too...
In functions such as pymcspearman()
, having bootstrap=True
and Nboot=10000
is redundant. Eliminate the flag argument and set Nboot=None
by default. Non-none values will enable bootstrapping or perturbation.
Since v1.17, numpy has a new procedure for generating pseudorandom sequences. Migrate the code to this:
https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.default_rng
From Ilsang:
I am really curious if the output correlation coefficient recovers the intrinsic correlation coefficient within certain confidence interval. The correlation coefficient was defined by cov(X,Y)/{var(X)*var(Y)}, so you can easily generate random data points with given covariance matrix and you know what the input correlation coefficient is. If you do simulations of measuring correlation coefficient of the simulated dataset (using your program) with different sample size and measurement errors, how reliably can you recover the input correlation coefficient?
https://scipy-cookbook.readthedocs.io/items/CorrelatedRandomSamples.html
For computing Kendall's tau with censored data, the bootstrapping code does not also resample the limit flags when resampling the data points.
If the number of bootstraps requested is larger than the number of possible permutations, duplicate coefficient/p-value results will be returned. This will lead to over-representation of that value in the probability distribution.
This check can be done at the start, when data lengths are checked for consistency:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.