GithubHelp home page GithubHelp logo

pawel-czyz / labelshift Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 101 KB

Python library for label shift and quantification.

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.11% Python 99.89%
data-shift prior-probability-shift label-shift quantification machine-learning

labelshift's Introduction

Label Shift

Python library for quantification (estimating the class prevalence in an unlabeled data set) under the prior probability shift assumption.

This module is created with two purposes in mind:

  • easily apply state-of-the-art quantification algorithms to the real problems,
  • benchmark novel quantification algorithms against others.

It is compatible with any classifier using any machine learning framework.

The code inside was used to run the experiments in our preprint, which can be cited as:

@misc{https://doi.org/10.48550/arxiv.2302.09159,
  doi = {10.48550/ARXIV.2302.09159},
  url = {https://arxiv.org/abs/2302.09159},
  author = {Ziegler, Albert and Czyż, Paweł},
  title = {Bayesian Quantification with Black-Box Estimators},
  publisher = {arXiv},
  year = {2023}
}

Installation

Currently the module is in early development stage and is not ready to be installed. It does not have proper documentation either. We hope to change it soon – thank you for your patience!

Contributions

Contributions are very welcome! Please, check our Contribution guide.

labelshift's People

Contributors

pawel-czyz avatar

Stargazers

 avatar

Watchers

 avatar

labelshift's Issues

Credible intervals and misspecification

Create the following experiment:

  1. Sample $w \sim \mathrm{Uniform}(0.05, 0.95)$ many times, e.g., $S=200$.
  2. For each weight $w$ construct a data set for some $N'=N=1000$ from a mixture of two Student distributions. One with $\nu = 3$, one with $mu_1 = 0$ and another at $\mu_2 = \delta$ for $\delta \sim \mathrm{Uniform}(0.5, 3)$. Dispersion should be of order 0.5.
  3. Fit using HMC the following models:
  • Well-specified Student mixture.
  • Misspecified Gaussian mixture.
  • Partition the real axis into $K$ bins for different $K$ and fit the usual model.
  1. Basing on the samples calculate the HDI credible intervals changing coverage.
  2. Compare the credible interval coverage with the frequentist coverage. We expect that well-specified model will have approximately correct coverage, misspecified Gaussian mixture will have too small coverage, and binned model will have too large coverage (will be a bit more conservative than a correctly specified model).

For the figure:

  1. Plot some data distribution.
  2. Plot the posteriors (two panels).
  3. Plot the coverage.

Refactor categorical classifier benchmark into Snakemake

The six-panel plot should be refactored into one Snakemake.

Requirements

  • Use mean $\pm$ standard deviation, rather than a boxplot.
  • Consistent colors of different estimators.
  • Use posterior mean, BBSE and two kinds (restricted and unrestricted) of IR.

Single-cell data analysis

Take some single-cell data.

Split samples into:

  1. Training.
  2. Validation.
  3. Test.

Train a random forest on training, evaluate on validation, evaluate on test.

The plot:

  1. PCA of normalized data coloured by cell types.
  2. PCA of normalized data coloured by samples.
  3. Estimate cell type prevalence. Compare with measured frequencies.

Comparison with bootstrap

Create the following figures:

Nearly nonidentifiable model and four sample visualisations. Posterior samples, BBSE + bootstrap, both kinds of IR + bootstrap. $N=500$ or something like that. Do it for 5 seeds. Main part will have one seed, the rest will go into the appendix.

Similar experiment, but quite identifiable model with small $N$. Hence, there is quite a lot of uncertainty in estimating the $P(C\mid Y)$ matrix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.