The Quantile-Based Balanced Sampling Algorithm is a method for balancing imbalanced datasets, particularly when there is a significant disparity between the number of samples in majority and minority classes. This algorithm helps create a more balanced dataset to improve machine learning model performance by calculating quantiles for each feature and selecting the closest non-minority class samples to each quantile permutation.
The algorithm works by calculating quantiles for each feature in the dataset, generating a set of permutations of the quantiles, and selecting the closest non-minority class samples to each quantile permutation to balance the dataset. This process preserves the underlying data distribution while ensuring an equal representation of both majority and minority classes.
-
Count the unique non-minority class labels (c), minority class samples (m), and features (f).
-
Create an empty set (d) and add all minority class samples to it.
-
Calculate the number of quantiles (q) such that
f^q=c*m
. -
Calculate the
q
quantiles for each feature. -
Generate a set of all permutations of
c
quantiles (p). -
Sort the non-minority class samples by their distance to each quantile for each feature.
-
For each quantile permutation in
p
, add the closest non-minority class sample to setd
. -
Return the balanced dataset
d
.
This algorithm can be implemented in various programming languages such as Python, R, or MATLAB. You can use it to preprocess your imbalanced dataset before feeding it to your machine learning model. Please note that you may need to adjust the algorithm according to the specific data structure and requirements of your project.
Look at the qbs.py file for a sample implementation.
We welcome contributions to improve this algorithm. Feel free to submit pull requests or raise issues to discuss potential improvements, bug fixes, or feature requests.