GithubHelp home page GithubHelp logo

coaxlab / pycombat Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 6.0 45.79 MB

Python implementation of Combat for data harmonisation, allowing also to remove unwanted effects

License: MIT License

Python 34.87% Jupyter Notebook 65.13%
combat data-harmonization

pycombat's Introduction

pycombat

Python version of data harmonisation techinque COMBAT. This package also allows for covariate effects to be removed from the data in addition to batch effects.

Combat is a technique for data harmonisation based on a linear mixed model in which location and scale random effects across batches are adjusted using a bayesian approach (Johnson, 2007):

Original Combat tecnique allowed to keep the baseline effects alpha and the effects of interest beta by reintroducing these after harmonisation:

One extension of this python package is the possibility of removing unwanted variables' effect by no reintroducing them again. Using the same linear mixed model of the begining, we now separate the sources of covariation C from sources of effects of interest X:

And then in this case, combat adjustment will be given by:

Such an easy and straightforward modification to combat has been recently proposed and introduced by some authors (Wachinger, 2020).

References:

  • W. Evan Johnson, Cheng Li, Ariel Rabinovic, Adjusting batch effects in microarray expression data using empirical Bayes methods, Biostatistics, Volume 8, Issue 1, January 2007, Pages 118–127, https://doi.org/10.1093/biostatistics/kxj037

  • L. Dyrskjot, M. Kruhoffer, T. Thykjaer, N. Marcussen, J. L. Jensen,K. Moller, and T. F. Orntoft. Gene expression in the urinary bladder: acommon carcinoma in situ gene expression signature exists disregardinghistopathological classification.Cancer Res., 64:4040–4048, Jun 2004.

  • Christian Wachinger, Anna Rieckmann, Sebastian Pölsterl. Detect and Correct Bias in Multi-Site Neuroimaging Datasets. arXiv:2002.05049

  • Fortin, J. P., N. Cullen, Y. I. Sheline, W. D. Taylor, I. Aselcioglu, P. A. Cook, P. Adams, C. Cooper, M. Fava, P. J. McGrath, M. McInnis, M. L. Phillips, M. H. Trivedi, M. M. Weissman and R. T. Shinohara (2017). "Harmonization of cortical thickness measurements across scanners and sites." Neuroimage 167: 104-120.

Install

pip install pycombat

Usage

Following the spirit of scikit-learn, Combat is a class that includes a method called fit, which finds the fitted values of the linear mixed model, and transform, a method that used the previously learning paramters to adjust the data. There is also a method called fit_transform, which concatenates both methods.

So, the first thing that you need to do is to define a instance of this class:

combat = Combat()

At the time of defining the combat instance, you can pass it the folowing parameters:

  • method: which is either "p" for paramteric or "np" for non-parametric (not implemented yet!!)
  • conv: the criterion to decide when to stop the EB optimization step (default value = 0.0001)

Now, you have to call the method fit, passsing it the data.

combat.fit(Y=Y, b=b, X=X, C=C)

These input data consist of the following ingredients:

  • Y: The matrix of response variables, with dimensions [observations x features]
  • b: The array of batch label for each observation. In principle these could be labelled as numbers or strings.
  • X: The matrix of effects of interest to keep, with dimensions [observations x features_interest]
  • C: The matrix of covariates to remove, with dimensions [observations x features_covariates]

Important: If you have effects of interest or covariates that involve categorical features, make sure that you drop the first level of these categories when building the independent matrices, otherwise they would be singular. You can easily accomplished this using pandas and pd.get_dummies with the option drop_first checked.

After fitting the data, you can adjust it by calling the transform method:

Y_adjusted = combat.transform(Y=Y, b=b, X=X, C=C)

Alternatively, you can combine both steps by just calling the method fit_transform:

Y_adjusted = combat.fit_trasnform(Y=Y, b=b, X=X, C=C)

pycombat's People

Contributors

h2co3 avatar jrasero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pycombat's Issues

Change how batchs are taken in transform function

Currently, the matrix of batches in the transform function is extracted following the same procedure as in fit, i.e.using get dummies from pandas. This, however, will fail if the array of batches pass to the transform function differ from the one used in the fitting process. It may happen that in a training/test set partition, the second one is a subset of the batches used for training.

The solution would be to know to which specific batch in the estimated coefficients corresponds the batches passed to the transform function

Code

Hi,

first at all thank you for the code! Could you provide some example code using your code please? An example notebook using fit_transform for training series and transform for test set join 2 or 3 datasets with batch effect. please?

Thanks

Pablo

[Possible bug] `transform()` raises when number of batches is less than what `fit()` was called with

Hi there,

I am not sure if this is a bug, because I'm not familiar enough with the maths behind ComBat. In any case, I have a feeling this should work, so here it is.

I have a training pipeline in which random sampling occasionally results in a hold-out test set that does not represent all batches. When calling combat.transform(X_test) on the already-fit Combat instance, I get the following exception:

ValueError: Wrong number of categories for b

Obviously, if there were an additional, new batch at transform() time that was unknown at fit()ting time, that would be a legitimate error, because no state could have been recorded as to how that unknown batch must be corrected. However, I think asking for a transform that doesn't include all the batches previously encountered shouldn't be considered an error.

Accordingly, I propose the following change to pycombat.py:

< if len(np.unique(b)) != self.gamma_.shape[0]:
> if len(np.unique(b)) > self.gamma_.shape[0]:

Would this be possible, or is it a fundamental necessity of the ComBat algorithm that the number of batches be exactly the same when fitting and transforming? If it is something fundamental that cannot be fixed, then how do people deal with this kind of problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.