GithubHelp home page GithubHelp logo

zhiningliu1998 / mesa Goto Github PK

View Code? Open in Web Editor NEW
104.0 7.0 24.0 1.66 MB

[NeurIPS’20] ⚖️ Build powerful ensemble class-imbalanced learning models via meta-knowledge-powered resampler. | 设计元知识驱动的采样器解决类别不平衡问题

Home Page: https://arxiv.org/abs/2010.08830

License: MIT License

Python 33.48% Jupyter Notebook 66.52%
imbalanced-learning imbalanced-data meta-learning-algorithms meta-sampler ensemble ensemble-model ensemble-machine-learning mesa meta-training class-imbalance

mesa's Introduction

MESA: Meta-sampler for imbalanced learning

MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler (NeurIPS 2020)

MESA is a meta-learning-based ensemble learning framework for solving class-imbalanced learning problems. It is a task-agnostic general-purpose solution that is able to boost most of the existing machine learning models' performance on imbalanced data.

Cite Us

If you find this repository helpful in your work or research, we would greatly appreciate citations to the following paper:

@inproceedings{liu2020mesa,
    title={MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler},
    author={Liu, Zhining and Wei, Pengfei and Jiang, Jing and Cao, Wei and Bian, Jiang and Chang, Yi},
    booktitle={Conference on Neural Information Processing Systems},
    year={2020},
}

Table of Contents

Background

About MESA

We introduce a novel ensemble imbalanced learning (EIL) framework named MESA. It adaptively resamples the training set in iterations to get multiple classifiers and forms a cascade ensemble model. MESA directly learns a parameterized sampling strategy (i.e., meta-sampler) from data to optimize the final metric beyond following random heuristics. It consists of three parts: meta sampling as well as ensemble training to build ensemble classifiers, and meta-training to optimize the meta-sampler.

The figure below gives an overview of the MESA framework.

image

Pros and Cons of MESA

Here are some personal thoughts on the advantages and disadvantages of MESA. More discussions are welcome!

Pros:

  • 🍎 Wide compatiblilty.
    We decoupled the model-training and meta-training process in MESA, making it compatible with most of the existing machine learning models.
  • 🍎 High data efficiency.
    MESA performs strictly balanced under-sampling to train each base-learner in the ensemble. This makes it more data-efficient than other methods, especially on highly skewed data sets.
  • 🍎 Good performance.
    The sampling strategy is optimized for better final generalization performance, we expect this can provide us with a better ensemble model.
  • 🍎 Transferability.
    We use only task-agnostic meta-information during meta-training, which means that a meta-sampler can be directly used in unseen new tasks, thereby greatly reducing the computational cost brought about by meta-training.

Cons:

  • 🍏 Meta-training cost.
    Meta-training repeats the ensemble training process multiple times, which can be costly in practice (By shrinking the dataset used in meta-training, the computational cost can be reduced at the cost of minor performance loss).
  • 🍏 Need to set aside a separate validation set for training.
    The meta-state is formed by computing the error distribution on both the training and validation sets.
  • 🍏 Possible unstable performance on small datasets.
    Small datasets may cause the obtained error distribution statistics to be inaccurate/unstable, which will interfere with the meta-training process.

Requirements

Main dependencies:

To install requirements, run:

pip install -r requirements.txt

NOTE: this implementation requires an old version of PyTorch (v1.0.0). You may want to start a new conda environment to run our code. The step-by-step guide is as follows (using torch-cpu for an example):

  • conda create --name mesa python=3.7.11
  • conda activate mesa
  • conda install pytorch-cpu==1.0.0 torchvision-cpu==0.2.1 cpuonly -c pytorch
  • pip install -r requirements.txt

These commands should help you to get ready for running mesa. If you have any further questions, please feel free to open an issue or drop me an email.

Usage

A typical usage example:

# load dataset & prepare environment
args = parser.parse_args()
rater = Rater(args.metric)
X_train, y_train, X_valid, y_valid, X_test, y_test = load_dataset(args.dataset)
base_estimator = DecisionTreeClassifier()

# meta-training
mesa = Mesa(
    args=args, 
    base_estimator=base_estimator, 
    n_estimators=10)
mesa.meta_fit(X_train, y_train, X_valid, y_valid, X_test, y_test)

# ensemble training
mesa.fit(X_train, y_train, X_valid, y_valid)

# evaluate
y_pred_test = mesa.predict_proba(X_test)[:, 1]
score = rater.score(y_test, y_pred_test)

Running main.py

Here is an example:

python main.py --dataset Mammo --meta_verbose 10 --update_steps 1000

You can get help with arguments by running:

python main.py --help
optional arguments:
  # Soft Actor-critic Arguments
  -h, --help            show this help message and exit
  --env-name ENV_NAME
  --policy POLICY       Policy Type: Gaussian | Deterministic (default:
                        Gaussian)
  --eval EVAL           Evaluates a policy every 10 episode (default:
                        True)
  --gamma G             discount factor for reward (default: 0.99)
  --tau G               target smoothing coefficient(τ) (default: 0.01)
  --lr G                learning rate (default: 0.001)
  --lr_decay_steps N    step_size of StepLR learning rate decay scheduler
                        (default: 10)
  --lr_decay_gamma N    gamma of StepLR learning rate decay scheduler
                        (default: 0.99)
  --alpha G             Temperature parameter α determines the relative
                        importance of the entropy term against the reward
                        (default: 0.1)
  --automatic_entropy_tuning G
                        Automaically adjust α (default: False)
  --seed N              random seed (default: None)
  --batch_size N        batch size (default: 64)
  --hidden_size N       hidden size (default: 50)
  --updates_per_step N  model updates per simulator step (default: 1)
  --update_steps N      maximum number of steps (default: 1000)
  --start_steps N       Steps sampling random actions (default: 500)
  --target_update_interval N
                        Value target update per no. of updates per step
                        (default: 1)
  --replay_size N       size of replay buffer (default: 1000)

  # Mesa Arguments
  --cuda                run on CUDA (default: False)
  --dataset N           the dataset used for meta-training (default: Mammo)
  --metric N            the metric used for evaluate (default: aucprc)
  --reward_coefficient N
  --num_bins N          number of bins (default: 5). state-size = 2 *
                        num_bins.
  --sigma N             sigma of the Gaussian function used in meta-sampling
                        (default: 0.2)
  --max_estimators N    maximum number of base estimators in each meta-
                        training episode (default: 10)
  --meta_verbose N      number of episodes between verbose outputs. If 'full'
                        print log for each base estimator (default: 10)
  --meta_verbose_mean_episodes N
                        number of episodes used for compute latest mean score
                        in verbose outputs.
  --verbose N           enable verbose when ensemble fit (default: False)
  --random_state N      random_state (default: None)
  --train_ir N          imbalance ratio of the training set after meta-
                        sampling (default: 1)
  --train_ratio N       the ratio of the data used in meta-training. set
                        train_ratio<1 to use a random subset for meta-training
                        (default: 1)

We include a highly imbalanced dataset Mammography (#majority class instances = 10,923, #minority class instances = 260, imbalance ratio = 42.012) and its variants with flip label noise for quick testing and visualization of MESA and other baselines. You can use mesa-example.ipynb to quickly:

  • conduct a comparative experiment
  • visualize the meta-training process of MESA
  • visualize the experimental results of MESA and other baselines

Please check mesa-example.ipynb for more details.

Visualization and Results

Class distribution of Mammography dataset

image

Visualize the meta-training process

Comparison with baseline methods

image

Other results

Dataset description

image

Comparisons of MESA with under-sampling-based EIL methods

image

Comparisons of MESA with over-sampling-based EIL methods

image

Comparisons of MESA with resampling-based EIL methods

image

Miscellaneous

Check out our previous work Self-paced Ensemble (ICDE 2020).
It is a simple heuristic-based method, but being very fast and works reasonably well.

This repository contains:

  • Implementation of MESA
  • Implementation of 7 ensemble imbalanced learning baselines
    • SMOTEBoost [1]
    • SMOTEBagging [2]
    • RAMOBoost [3]
    • RUSBoost [4]
    • UnderBagging [5]
    • BalanceCascade [6]
    • SelfPacedEnsemble [7]
  • Implementation of 11 resampling imbalanced learning baselines [8]

NOTE: The implementations of the above baseline methods are based on imbalanced-algorithms and imbalanced-learn.

References

# Reference
[1] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, Smoteboost: Improving prediction of the minority class in boosting. in European conference on principles of data mining and knowledge discovery. Springer, 2003, pp. 107–119
[2] S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models. in 2009 IEEE Symposium on Computational Intelligence and Data Mining. IEEE, 2009, pp. 324–331.
[3] Sheng Chen, Haibo He, and Edwardo A Garcia. 2010. RAMOBoost: ranked minority oversampling in boosting. IEEE Transactions on Neural Networks 21, 10 (2010), 1624–1642.
[4] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2010.
[5] R. Barandela, R. M. Valdovinos, and J. S. Sanchez, New applications´ of ensembles of classifiers. Pattern Analysis & Applications, vol. 6, no. 3, pp. 245–256, 2003.
[6] X.-Y. Liu, J. Wu, and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2009.
[7] Zhining Liu, Wei Cao, Zhifeng Gao, Jiang Bian, Hechang Chen, Yi Chang, and Tie-Yan Liu. 2019. Self-paced Ensemble for Highly Imbalanced Massive Data Classification. 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 2020, pp. 841-852.
[8] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Zhining Liu

🤔 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

mesa's People

Contributors

allcontributors[bot] avatar emerylau avatar zhiningliu1998 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mesa's Issues

数据贴标签以及train_ir设置问题

[作者你好,很荣幸能够接触您整理和开源的这个项目。手上正好有一个数据不均衡的多分类项目,我尝试了imbalanced-ensemble中所有的方法,对于我的项目并没有太好的提升。所以想试试这个元学习,之前没有接触过元学习,因为项目急急可能没法去补充详细知识,问的问题可能有点。。。。

是这样的,我把我的数据转换成多个二分类来适应这个项目。但是在运行时,在train_ir有点问题。这个图片的代码是需要0是多数类,1是少数类嘛?不然得需要用train_ir来控制吧?同时这里的imbalance ratio是期望多数类比少数类吧?

同时想问下,对于离散特征数据,多分类数据重叠和不均衡有没有很什么好方法可以推荐下。

最后,再感谢大佬开源的项目。祝您身体健康,合家欢乐。

xx

error : mesa.predict_proba

import time
from mesa import Mesa
from arguments import parser
from utils import Rater, load_dataset
from sklearn.tree import DecisionTreeClassifier

if name == 'main':

# load dataset & prepare environment
args = parser.parse_args()
rater = Rater(args.metric)
X_train, y_train, X_valid, y_valid, X_test, y_test = load_dataset(args.dataset)
base_estimator = DecisionTreeClassifier(max_depth=None)

# meta-training
print ('\nStart meta-training of MESA ... ...\n')
mesa = Mesa(
	args=args, 
	base_estimator=base_estimator, 
	n_estimators=args.max_estimators)
mesa.meta_fit(X_train, y_train, X_valid, y_valid, X_test, y_test)
mesa.predict_proba(X_test)

run....................

mesa.predict_proba(X_test)

File "D:\pyyj\mesa-master\environment.py", line 84, in predict_proba
if y_pred.shape[1] == 1:

IndexError: tuple index out of range

请求修改代码以适配torch新的版本

查资料torch1.4.0以后的版本会报错
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [50, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

1.4.0是比较老的版本了 好像不支持python3.9 所以请求修改mesa的代码以适配新的torch
谢谢大神!!!

Issue running model

Hi,

Thanks for the great work. I tried installing the dependencies as in explained in the last version of the ReadMe file and I got:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [50, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Alternatively, when installing the same version of pytorch 1.0.0 with GPU support I got this different issue:
https://discuss.pytorch.org/t/undefined-symbol-cblas-sgemm-alloc/32497

Is there any other way to build the dependencies?

在多分类任务下报错

我尝试直接修改源代码中的工具类utils.py,完全按照load_dataset的方式读取了自己的数据集,在Rater类下的score函数里加上了acc和micro f1_score尝试做多分类下的评估,用y_pred.argmax(axis=1)来确定预测类别,但是要么维度不匹配,要么就返回TypeError: Singleton array 4 cannot be considered a valid collection..,请问有没有多分类的usage案例可供学习使用?

Not able to run the model

Hi,

I've been trying to run your mesa_example notebook, however, I haven't managed to make it work. When the meta_fit() method is called, I'm getting a RuntimeError. I've tried debugging it, but without any luck. Here is a screenshot of the problem:

Screenshot 2021-08-18 at 12 54 42

Have you encountered this? If yes, could you please tell me how to fix it?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.