GithubHelp home page GithubHelp logo

Comments (18)

glemaitre avatar glemaitre commented on May 21, 2024 3

Yep, we need to accept string as input. Right now check_X_y accept only numeric value.
So those algorithm could overwrite the function to handle those data.

from imbalanced-learn.

glemaitre avatar glemaitre commented on May 21, 2024 1

@nchen9191 SMOTE implementation is only for continuous data. We did not implement yet the SMOTE-NC that should deal with categorical features.

from imbalanced-learn.

glemaitre avatar glemaitre commented on May 21, 2024 1

from imbalanced-learn.

fmfn avatar fmfn commented on May 21, 2024

It assumed that you will first vectorize your categorical features with your preferred method.

SMOTE and variations work by calculating distances between examples from the majority and minority classes. In order to be able to calculate such distances your data has to be formatted as a feature vector per entry. That means that categorical features must first be encoded to numerical values (e.g.: by using one hot encoding) before being passed to the object.

At the end of the end the SMOTE method (and all methods in this package for that matter) take as input a design matrix with all entries being numbers in addition to the respective labels.

Does that help?

from imbalanced-learn.

jacobmontiel avatar jacobmontiel commented on May 21, 2024

@fmfn @glemaitre
Related discussion
Oversampling with categorical variables

Weka gets a data set with categorical "C" and numerical "N" features and returns an over-sampled data set keeping the same data types:

Input
schema: [C | N | N | C | N]
samples = n

Output
schema: [C | N | N | C | N]
samples = n + ratio*minonrity_class_samples

Reference code
Weka - SMOTE.java

from imbalanced-learn.

nchen9191 avatar nchen9191 commented on May 21, 2024

I encoded my categorical variables to integers using panda's factorize method. But it seems like SMOTE still treated these variables as continuous and thus created new data where the entry for these categorical variables look like 0.954 or 0.145. Is that supposed to happen? I read somewhere that it may be safe just to round these numbers back to integers, but that seems a little unsafe to me. Please advise.

Thanks!

from imbalanced-learn.

nishkalavallabhi avatar nishkalavallabhi commented on May 21, 2024

Is it still only for continuous data?

from imbalanced-learn.

glemaitre avatar glemaitre commented on May 21, 2024

from imbalanced-learn.

dbarrundiag avatar dbarrundiag commented on May 21, 2024

@glemaitre Hi, I was just wondering if certain algorithms like the RandomUnderSampler, that do not calculate distances between examples from the majority and minority classes, could potentially be implemented easier to handle Categorical Variables? Thank you very much!

from imbalanced-learn.

parulsahi avatar parulsahi commented on May 21, 2024

But doesn't the SMOTE algo use majority rule to find the value of categorical variable from the neighbors being considered?
I am using SMOTE algo but it is converting a nominal variable with categories(0 and 1) into continuous values between 0 and 1.
Is there a solution, maybe a modification to the categorical variables before feeding them into the SMOTE function.
Thank you.

from imbalanced-learn.

glemaitre avatar glemaitre commented on May 21, 2024

Use SMOTENC for mix of categorical and continuous variable

from imbalanced-learn.

atendra12 avatar atendra12 commented on May 21, 2024

Hi,

Thanks for this wonderful package to handle class imbalance. I am trying to use SMOTENC but getting stuck in "memory error" during "fit_resample" method. I have already converted the dtypes and made them as small as possible even though this issue persist. On contrary if i use SMOTE it's working fine on the same data. I've 31 GB RAM and data shape is (98000,48), its around 6.5 MB on disk. I am using python 3.5 and imblearn version is '0.4.2'. can somebody suggest some hack to deal with issue. Thanks.

from imbalanced-learn.

lisiqi avatar lisiqi commented on May 21, 2024

@glemaitre Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values?

from imbalanced-learn.

melgazar9 avatar melgazar9 commented on May 21, 2024

I want to request an extra feature for SMOTENC that I think will help me in application. The normal SMOTE library has a parameter called ratio. Will you be able to add that to SMOTENC?

from imbalanced-learn.

glemaitre avatar glemaitre commented on May 21, 2024

The ratio parameter has been deprecated in favour of sampling_strategy. Use sampling_strategy the same way that ratio was working: https://imbalanced-learn.readthedocs.io/en/latest/generated/imblearn.over_sampling.SMOTENC.html

from imbalanced-learn.

melgazar9 avatar melgazar9 commented on May 21, 2024

Oh I see - thanks. I am having a bit of trouble getting SMOTENC to fit on a pandas dataframe. A test example won't seem to work but the code on the website works. I can't seem to figure out what I'm doing wrong. Do you see anything wrong with this?

s1 = pd.Series([1,2,3,4,5,6])
s2 = pd.Series([1,2,2,9,3,5])
s3 = pd.Series([9,8,3,5,2,3])
s4 = pd.Series([0,1,1,0,1,0])
s5 = pd.Series([0,1,0,0,0,1])
df = pd.concat([s1,s2,s3,s4,s5], axis=1).rename(columns={0:'col1',1:'col2',2:'col3',3:'col4', 4:'col5'})

sm = SMOTENC(categorical_features=['col4', 'col5'])
X,y = sm.fit_resample(df[['col1','col2','col4']], df['col3'])

ValueError Traceback (most recent call last)
in
----> 1 sm = SMOTENC(categorical_features=['col4', 'col5']).fit_resample(df2[['col1','col2','col4']], df2['col3'])

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
83 self.sampling_strategy, y, self._sampling_type)
84
---> 85 output = self._fit_resample(X, y)
86
87 if binarize_y:

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
938 def fit_resample(self, X, y):
939 self.n_features
= X.shape[1]
--> 940 self._validate_estimator()
941
942 # compute the median of the standard deviation of the minority class

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/smote.py in validate_estimator(self)
931 raise ValueError(
932 'Some of the categorical indices are out of range. Indices'
--> 933 ' should be between 0 and {}'.format(self.n_features
))
934 self.categorical_features
= categorical_features
935 self.continuous_features_ = np.setdiff1d(np.arange(self.n_features_),

ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 3

from imbalanced-learn.

glemaitre avatar glemaitre commented on May 21, 2024

Please open a new issue instead of commenting on a closed issue.

from imbalanced-learn.

glemaitre avatar glemaitre commented on May 21, 2024

You should pass the numerical indices and not column name as indicated in the documentation.

from imbalanced-learn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.