First, thanks for sharing the tools for us. And I want to generates synthetic samp

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

What should I do to handle categorical variables?,about scikit-learn-contrib/imbalanced-learn

Comments (18)

glemaitre commented on May 21, 2024 3

Yep, we need to accept string as input. Right now check_X_y accept only numeric value.
So those algorithm could overwrite the function to handle those data.

from imbalanced-learn.

glemaitre commented on May 21, 2024 1

@nchen9191 SMOTE implementation is only for continuous data. We did not implement yet the SMOTE-NC that should deal with categorical features.

from imbalanced-learn.

glemaitre commented on May 21, 2024 1

Nop. SMOTE-NC is for both categorical and numerical. I think that it should be another variant for SMOTE to handle solely categorical.

…

On Tue, 13 Nov 2018 at 16:35, Siqi Li ***@***.***> wrote: @glemaitre <https://github.com/glemaitre> Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHG9P3xy5fvcOzAUU6aSZMxbCMFXnhrBks5uuua_gaJpZM4Hp91j> .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

from imbalanced-learn.

fmfn commented on May 21, 2024

It assumed that you will first vectorize your categorical features with your preferred method.

SMOTE and variations work by calculating distances between examples from the majority and minority classes. In order to be able to calculate such distances your data has to be formatted as a feature vector per entry. That means that categorical features must first be encoded to numerical values (e.g.: by using one hot encoding) before being passed to the object.

At the end of the end the SMOTE method (and all methods in this package for that matter) take as input a design matrix with all entries being numbers in addition to the respective labels.

Does that help?

from imbalanced-learn.

jacobmontiel commented on May 21, 2024

@fmfn @glemaitre
Related discussion
Oversampling with categorical variables

Weka gets a data set with categorical "C" and numerical "N" features and returns an over-sampled data set keeping the same data types:

Input
schema: [C | N | N | C | N]
samples = n

Output
schema: [C | N | N | C | N]
samples = n + ratio*minonrity_class_samples

Reference code
Weka - SMOTE.java

from imbalanced-learn.

nchen9191 commented on May 21, 2024

I encoded my categorical variables to integers using panda's factorize method. But it seems like SMOTE still treated these variables as continuous and thus created new data where the entry for these categorical variables look like 0.954 or 0.145. Is that supposed to happen? I read somewhere that it may be safe just to round these numbers back to integers, but that seems a little unsafe to me. Please advise.

Thanks!

from imbalanced-learn.

nishkalavallabhi commented on May 21, 2024

Is it still only for continuous data?

from imbalanced-learn.

glemaitre commented on May 21, 2024

Yep we still did not implement categorical methods. PR welcomed

from imbalanced-learn.

dbarrundiag commented on May 21, 2024

@glemaitre Hi, I was just wondering if certain algorithms like the RandomUnderSampler, that do not calculate distances between examples from the majority and minority classes, could potentially be implemented easier to handle Categorical Variables? Thank you very much!

from imbalanced-learn.

parulsahi commented on May 21, 2024

But doesn't the SMOTE algo use majority rule to find the value of categorical variable from the neighbors being considered?
I am using SMOTE algo but it is converting a nominal variable with categories(0 and 1) into continuous values between 0 and 1.
Is there a solution, maybe a modification to the categorical variables before feeding them into the SMOTE function.
Thank you.

from imbalanced-learn.

glemaitre commented on May 21, 2024

Use SMOTENC for mix of categorical and continuous variable

from imbalanced-learn.

atendra12 commented on May 21, 2024

Hi,

Thanks for this wonderful package to handle class imbalance. I am trying to use SMOTENC but getting stuck in "memory error" during "fit_resample" method. I have already converted the dtypes and made them as small as possible even though this issue persist. On contrary if i use SMOTE it's working fine on the same data. I've 31 GB RAM and data shape is (98000,48), its around 6.5 MB on disk. I am using python 3.5 and imblearn version is '0.4.2'. can somebody suggest some hack to deal with issue. Thanks.

from imbalanced-learn.

lisiqi commented on May 21, 2024

@glemaitre Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values?

from imbalanced-learn.

melgazar9 commented on May 21, 2024

I want to request an extra feature for SMOTENC that I think will help me in application. The normal SMOTE library has a parameter called ratio. Will you be able to add that to SMOTENC?

from imbalanced-learn.

glemaitre commented on May 21, 2024

The ratio parameter has been deprecated in favour of sampling_strategy. Use sampling_strategy the same way that ratio was working: https://imbalanced-learn.readthedocs.io/en/latest/generated/imblearn.over_sampling.SMOTENC.html

from imbalanced-learn.

melgazar9 commented on May 21, 2024

Oh I see - thanks. I am having a bit of trouble getting SMOTENC to fit on a pandas dataframe. A test example won't seem to work but the code on the website works. I can't seem to figure out what I'm doing wrong. Do you see anything wrong with this?

s1 = pd.Series([1,2,3,4,5,6])
s2 = pd.Series([1,2,2,9,3,5])
s3 = pd.Series([9,8,3,5,2,3])
s4 = pd.Series([0,1,1,0,1,0])
s5 = pd.Series([0,1,0,0,0,1])
df = pd.concat([s1,s2,s3,s4,s5], axis=1).rename(columns={0:'col1',1:'col2',2:'col3',3:'col4', 4:'col5'})

sm = SMOTENC(categorical_features=['col4', 'col5'])
X,y = sm.fit_resample(df[['col1','col2','col4']], df['col3'])

ValueError Traceback (most recent call last)
in
----> 1 sm = SMOTENC(categorical_features=['col4', 'col5']).fit_resample(df2[['col1','col2','col4']], df2['col3'])

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
83 self.sampling_strategy, y, self._sampling_type)
84
---> 85 output = self._fit_resample(X, y)
86
87 if binarize_y:

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
938 def fit_resample(self, X, y):
939 self.n_features = X.shape[1]
--> 940 self._validate_estimator()
941
942 # compute the median of the standard deviation of the minority class

~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/smote.py in validate_estimator(self)
931 raise ValueError(
932 'Some of the categorical indices are out of range. Indices'
--> 933 ' should be between 0 and {}'.format(self.n_features))
934 self.categorical_features = categorical_features
935 self.continuous_features_ = np.setdiff1d(np.arange(self.n_features_),

ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 3

from imbalanced-learn.

glemaitre commented on May 21, 2024

Please open a new issue instead of commenting on a closed issue.

from imbalanced-learn.

glemaitre commented on May 21, 2024

You should pass the numerical indices and not column name as indicated in the documentation.

from imbalanced-learn.

What should I do to handle categorical variables? about imbalanced-learn HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs