rishi-kulkarni / hierarch Goto Github PK

View Code? Open in Web Editor NEW

6.0 1.0 0.0 704 KB

Resampling-Based Hypothesis Testing for Python

License: MIT License

Python 100.00%

hypothesis-tests resampling-strategies bootstrapping-statistics permutation-statistics

hierarch's Introduction

hierarch

A Hierarchical Resampling Package for Python

Version 1.1.6

hierarch is a package for hierarchical resampling (bootstrapping, permutation) of datasets in Python. Because for loops are ultimately intrinsic to cluster-aware resampling, hierarch uses Numba to accelerate many of its key functions.

hierarch has several functions to assist in performing resampling-based (and therefore distribution-free) hypothesis tests, confidence interval calculations, and power analyses on hierarchical data.

Introduction
Setup
Documentation
Citation

Introduction

Design-based randomization tests represents the platinum standard for significance analyses [1, 2, 3] - that is, they produce probability statements that depend only on the experimental design, not at all on less-than-verifiable assumptions about the probability distributions of the data-generating process. Researchers can use hierarch to quickly perform automated design-based randomization tests for experiments with arbitrary levels of hierarchy.

[1] Tukey, J.W. (1993). Tightening the Clinical Trial. Controlled Clinical Trials, 14(4), 266-285.

[2] Millard, S.P., Krause, A. (2001). Applied Statistics in the Pharmaceutical Industry. Springer.

[3] Berger, V.W. (2000). Pros and cons of permutation tests in clinical trials. Statistics in Medicine, 19(10), 1319-1328.

Setup

Dependencies

numpy
pandas (for importing data)
numba
scipy (for power analysis)

Installation

The easiest way to install hierarch is via PyPi.

pip install hierarch

Alternatively, you can install from Anaconda.

conda install -c rkulk111 hierarch

Documentation

Check out our user guide at readthedocs.

Citation

If hierarch helps you analyze your data, please consider citing it. The manuscript also contains a set of simulations validating hierarchical randomization tests in a variety of conditions.

Kulkarni RU, Wang CL, Bertozzi CR (2022) Analyzing nested experimental designs—A user-friendly resampling method to determine experimental significance. PLoS Comput Biol 18(5): e1010061. https://doi.org/10.1371/journal.pcbi.1010061

hierarch's People

Contributors

Stargazers

Watchers

hierarch's Issues

Update citation in readme

The Numba PRNG needs to be seeded at the same time as the Numpy PRNG

Right now, the PRNGs need to be seeded separately, which is inconvenient.

Things to watch out for:

Numpy generators use PCG64, while Numba uses MT (I think this is still true?).

Allow to only receive resampling indices (make hierarch agnostic to specific statistical tests).

I am currently working with data from the Human Connectome Project. The dataset contains a lot of siblings, sometimes even dizygotic or even monozygotic twins. Basically for all my statistical tests, I would have to consider this nested structure when computing p-values (which I would like to derive from permutation test) or confidence intervals (which I would like to obtain from bootstrapping tests). This is how the demographic data would look like:

The problem is, that I would like to conduct multivariate tests (i.e. CCA to be more specific). Which means that besides from the demographic information that could be used to conduct a hierarchical resampling procedure, I also have matrices X and y that contain the "actual data" and that I would use to perform my statistical test. When looking at the Hypothesis Testing section, I can see that hierarch currently seems to be restricted to simple univariate tests like t-tests and ANOVA? Do you think it would be feasible to allow users to only obtain resampled indices so they could apply their own tests?

This is my current approach which is quite tedious:
1.) Use FSL-Palm to derive exchangability blocks using the hcp2blocks.m function
2.) Use a python implementation of FSL-Palm's quicker function (skpalm.permutations.quickperms) to derive a matrix of reampled indices.

Implement helper function to create design matrix?

Hi! As an add-on to #127 , it might not only be nice just to obtain hierarchically resampled indices (so that users are independent from using one of the pre-defined tests provided by hierarch) but also to have helper function that allows you to create a design matrix that is needed as input to create the indices in the first place. For example, right now I am using

[...] FSL-Palm to derive exchangability blocks using the hcp2blocks.m function

This function takes over the tedious work of manually defining the EB-matrix yourself. For example consider the following dataframe - Some subjects in the HCP sample are completely unrelated to anyone (Family 2 & 5 only appear one time). Some subjects are related (having the same family id) but are regular sibilings. Some other subjects are also related, and on top they can be monozygotic or dizygotic twins. Here it get's a philosophical because MZ-twins can be considered as clones whereas for DZ-twins the hcp2blocks.m function allows the user to decide if DZ-twins should be treated as regular siblings or clones. In my case however, I also have repeated measures for each subject (which can not be interpreted by hcp2blocks.m). I fitted a mixed model (value ~ 1 + (1 | Subject)) and would like to obtain hierarchically resampled indices that respect the hierarchical structure of the HCP-dataset. Do you think it would make sense to implement such a function?

Here's the code to regenerate the dataframe:

df = pd.DataFrame( {'Subject': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4, 12: 5, 13: 5, 14: 5, 15: 6, 16: 6, 17: 6, 18: 7, 19: 7, 20: 7, 21: 8, 22: 8, 23: 8}, 'Family_ID': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4, 12: 1, 13: 1, 14: 1, 15: 4, 16: 4, 17: 4, 18: 5, 19: 5, 20: 5, 21: 3, 22: 3, 23: 3}, 'ZygosityGT': {0: 'MZ', 1: 'MZ', 2: 'MZ', 3: 'NoTwin', 4: 'NoTwin', 5: 'NoTwin', 6: 'NoTwin', 7: 'NoTwin', 8: 'NoTwin', 9: 'DZ', 10: 'DZ', 11: 'DZ', 12: 'MZ', 13: 'MZ', 14: 'MZ', 15: 'DZ', 16: 'DZ', 17: 'DZ', 18: 'NoTwin', 19: 'NoTwin', 20: 'NoTwin', 21: 'NoTwin', 22: 'NoTwin', 23: 'NoTwin'}, 'condition': {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3, 6: 1, 7: 2, 8: 3, 9: 1, 10: 2, 11: 3, 12: 1, 13: 2, 14: 3, 15: 1, 16: 2, 17: 3, 18: 1, 19: 2, 20: 3, 21: 1, 22: 2, 23: 3}, 'value': {0: 0.7739560485559633, 1: 0.4388784397520523, 2: 0.8585979199113825, 3: 0.6973680290593639, 4: 0.09417734788764953, 5: 0.9756223516367559, 6: 0.761139701990353, 7: 0.7860643052769538, 8: 0.12811363267554587, 9: 0.45038593789556713, 10: 0.37079802423258124, 11: 0.9267649888486018, 12: 0.6438651200806645, 13: 0.82276161327083, 14: 0.44341419882733113, 15: 0.2272387217847769, 16: 0.5545847870158348, 17: 0.06381725610417532, 18: 0.8276311719925821, 19: 0.6316643991220648, 20: 0.7580877400853738, 21: 0.35452596812986836, 22: 0.9706980243949033, 23: 0.8931211213221977}} )

Create bootstrapped datasets for regression problems?

Hi @rishi-kulkarni, I would like to use your package to create a list of bootstrapped datasets (again referring to HCP data), but I noticed that hierarch.resampling.Bootstrapper.fit() wants to have a value for y to define a treatment and control group. However, the HCP-dataset does not have treatment and control groups (in other words: All my analyses are regression problems). Is it still possible to generate bootstrapped datasets using your functions even if there are no groups?

Reminder: I would like to generate n bootstrapped datasets from the HCP dataset. In this dataset, subjects can belong to the same family or even be twins. I need a function that respects this structure so that resampled datasets are similar in that regard.

Switch build system to `poetry`

This will make it easier to specify what versions of numpy numba is compatible with.