GithubHelp home page GithubHelp logo

phiyodr / multilabel-oversampling Goto Github PK

View Code? Open in Web Editor NEW
11.0 2.0 1.0 367 KB

Many algorithms for imbalanced data support binary and multiclass classification only. This approach is made for mulit-label classification (aka multi-target classification). :sunflower:

License: MIT License

Python 100.00%
imbalanced-classification imbalanced-data oversampling oversampling-algorithms oversampling-technique upsampling

multilabel-oversampling's Introduction

Multilabel Oversampling ๐ŸŒป

Many algorithms for imbalanced data support binary and multiclass classification only. This approach is made for multi-label classification (aka multi-target classification).

๐ŸŽฐ Algorithm

  • Multilabel dataset (as pandas.DataFrame) with imbalanced data
  • Calculate counts per class and then calculate the standard deviation (std) of the count values
  • Do for number_of_adds times the following:
    • Randomly draw a sample from your data and calculate new std
    • If new std reduces, add sample to your dataset
    • If not, draw another sample (to this up to number_of_tries times)
  • A new df is returned.
  • A result plot visualizes the target distribution before and after upsampling. Moreover the counts per index are shown.

โžก๏ธ Usage

import multilabel_oversampling as mo

mo.seed_everything(20)
df = mo.create_fake_data(size=1) # difficult fake dataset with very high dependency of y1 and y2
ml_oversampler = mo.MultilabelOversampler(number_of_adds=100, number_of_tries=100)
df_new = ml_oversampler.fit(df)
#>Start the upsampling process.
#>Iteration:  11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        | 11/100 [00:00<00:01, 48.43it/s]
#>Iter 11: No improvement after 100 tries.
#>Sampling done.
#>
#>Dataset size original: 20; Upsampled dataset size: 31
#>Original target distribution:  {'y1': 16, 'y2': 12, 'y3': 4, 'y4': 4}
#>Upsampled target distribution: {'y1': 19, 'y2': 12, 'y3': 15, 'y4': 15}

ml_oversampler.plot_all_tries()

Plot from ml_oversampler.plot_all_tries()

ml_oversampler.plot_results()

Plot from ml_oversampler.plot_results()

#import seaborn as sns
#df.style.background_gradient(cmap=sns.color_palette("Spectral", as_cmap=True))

# Original DataFrame
print(df)
#>    y1  y2  y3  y4           x
#>0    1   1   0   0   img_0.jpg
#>1    1   1   0   0   img_1.jpg
#>2    1   1   0   1   img_2.jpg
#>3    1   1   0   0   img_3.jpg
#>4    1   1   1   0   img_4.jpg
#>5    1   1   0   0   img_5.jpg
#>6    1   1   0   0   img_6.jpg
#>7    1   1   0   0   img_7.jpg
#>8    1   1   0   1   img_8.jpg
#>9    1   1   0   0   img_9.jpg
#>10   1   1   0   0  img_10.jpg
#>11   1   1   0   0  img_11.jpg
#>12   1   0   1   0  img_12.jpg
#>13   1   0   1   1  img_13.jpg
#>14   1   0   0   0  img_14.jpg
#>15   1   0   0   0  img_15.jpg
#>16   0   0   0   0  img_16.jpg
#>17   0   0   0   0  img_17.jpg
#>18   0   0   0   0  img_18.jpg
#>19   0   0   1   1  img_19.jpg


# New DataFrame after upsampling
print(df_new)
#>    y1  y2  y3  y4           x
#>0    1   1   0   0   img_0.jpg
#>1    1   1   0   0   img_1.jpg
#>2    1   1   0   1   img_2.jpg
#>3    1   1   0   0   img_3.jpg
#>4    1   1   1   0   img_4.jpg
#>5    1   1   0   0   img_5.jpg
#>6    1   1   0   0   img_6.jpg
#>7    1   1   0   0   img_7.jpg
#>8    1   1   0   1   img_8.jpg
#>9    1   1   0   0   img_9.jpg
#>10   1   1   0   0  img_10.jpg
#>11   1   1   0   0  img_11.jpg
#>12   1   0   1   0  img_12.jpg
#>13   1   0   1   1  img_13.jpg
#>14   1   0   0   0  img_14.jpg
#>15   1   0   0   0  img_15.jpg
#>16   0   0   0   0  img_16.jpg
#>17   0   0   0   0  img_17.jpg
#>18   0   0   0   0  img_18.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>13   1   0   1   1  img_13.jpg
#>13   1   0   1   1  img_13.jpg
#>13   1   0   1   1  img_13.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg

โ„น๏ธ Install

  • Install from GitHub (you may need to install dependencies from requirements.txt first)
pip install git+https://github.com/phiyodr/multilabel-oversampling

๐Ÿ‘ท Future work

  • Implement weighted sampling (so that samples which are already often in the new df are less often sampled)

๐ŸŒป

multilabel-oversampling's People

Contributors

phiyodr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

frantisekbrabec

multilabel-oversampling's Issues

Issue with Installing multilabel-oversampling

Hello,

I hope this message finds you well. I recently encountered an issue while trying to install the multilabel-oversampling package directly from the GitHub repository using pip.

Specifically, during the installation process, I received a ModuleNotFoundError indicating that the numpy module was not found. It seems that the package attempts to import numpy during the build/setup phase before numpy is installed, even though numpy is listed as a dependency in install_requires in setup.py.

Here's the error message I received for reference:
File "/private/var/folders/9y/pzqtcdlx1jv2lcr6z_6sq6jm0000gn/T/pip-req-build-ktfxzxk3/multilabel_oversampling/multilabel_oversampling.py", line 1, in
import numpy as np
ModuleNotFoundError: No module named 'numpy'
[end of output]

To resolve the issue, I tried the following:

  • Listed numpy at the beginning of my requirements.txt file to ensure it's installed first.
  • Manually installed numpy and other dependencies before trying to install multilabel-oversampling.
  • Attempted to comment out the numpy imports temporarily to bypass the error.

While I eventually found a workaround, it would be helpful for other users if this issue could be addressed directly in the repository. Perhaps there's a way to ensure that the package doesn't attempt to import dependencies during the build/setup phase or to handle this in a way that doesn't result in an error.

Thank you for your attention to this matter, and I appreciate the work you've put into the multilabel-oversampling package!

Best regards,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.