Light

pawel-czyz / labelshift Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 101 KB

Python library for label shift and quantification.

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.11% Python 99.89%

data-shift prior-probability-shift label-shift quantification machine-learning

labelshift's Introduction

Label Shift

Python library for quantification (estimating the class prevalence in an unlabeled data set) under the prior probability shift assumption.

This module is created with two purposes in mind:

easily apply state-of-the-art quantification algorithms to the real problems,
benchmark novel quantification algorithms against others.

It is compatible with any classifier using any machine learning framework.

The code inside was used to run the experiments in our preprint, which can be cited as:

@misc{https://doi.org/10.48550/arxiv.2302.09159,
  doi = {10.48550/ARXIV.2302.09159},
  url = {https://arxiv.org/abs/2302.09159},
  author = {Ziegler, Albert and Czyż, Paweł},
  title = {Bayesian Quantification with Black-Box Estimators},
  publisher = {arXiv},
  year = {2023}
}

Installation

Currently the module is in early development stage and is not ready to be installed. It does not have proper documentation either. We hope to change it soon – thank you for your patience!

Contributions

Contributions are very welcome! Please, check our Contribution guide.

labelshift's People

Contributors

Stargazers

Watchers

labelshift's Issues

Credible intervals and misspecification

Create the following experiment:

Sample $w \sim \mathrm{Uniform}(0.05, 0.95)$ many times, e.g., $S=200$.
For each weight $w$ construct a data set for some $N'=N=1000$ from a mixture of two Student distributions. One with $\nu = 3$, one with $mu_1 = 0$ and another at $\mu_2 = \delta$ for $\delta \sim \mathrm{Uniform}(0.5, 3)$. Dispersion should be of order 0.5.
Fit using HMC the following models:

Well-specified Student mixture.
Misspecified Gaussian mixture.
Partition the real axis into $K$ bins for different $K$ and fit the usual model.

Basing on the samples calculate the HDI credible intervals changing coverage.
Compare the credible interval coverage with the frequentist coverage. We expect that well-specified model will have approximately correct coverage, misspecified Gaussian mixture will have too small coverage, and binned model will have too large coverage (will be a bit more conservative than a correctly specified model).

For the figure:

Plot some data distribution.
Plot the posteriors (two panels).
Plot the coverage.

Separate model building and sampling

Separate the phase of building the PyMC model and sampling (as we may e.g., try a MAP estimate instead).

Refactor categorical classifier benchmark into Snakemake

The six-panel plot should be refactored into one Snakemake.

Requirements

Use mean $\pm$ standard deviation, rather than a boxplot.
Consistent colors of different estimators.
Use posterior mean, BBSE and two kinds (restricted and unrestricted) of IR.

Single-cell data analysis

Take some single-cell data.

Split samples into:

Training.
Validation.
Test.

Train a random forest on training, evaluate on validation, evaluate on test.

The plot:

PCA of normalized data coloured by cell types.
PCA of normalized data coloured by samples.
Estimate cell type prevalence. Compare with measured frequencies.

Comparison with bootstrap

Create the following figures:

Nearly nonidentifiable model and four sample visualisations. Posterior samples, BBSE + bootstrap, both kinds of IR + bootstrap. $N=500$ or something like that. Do it for 5 seeds. Main part will have one seed, the rest will go into the appendix.

Similar experiment, but quite identifiable model with small $N$. Hence, there is quite a lot of uncertainty in estimating the $P(C\mid Y)$ matrix.

Implement Lipton et al. (2018) estimator

This estimator is described in this paper

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs