Comments (18)
Yep, we need to accept string as input. Right now check_X_y
accept only numeric value.
So those algorithm could overwrite the function to handle those data.
from imbalanced-learn.
@nchen9191 SMOTE implementation is only for continuous data. We did not implement yet the SMOTE-NC that should deal with categorical features.
from imbalanced-learn.
from imbalanced-learn.
It assumed that you will first vectorize your categorical features with your preferred method.
SMOTE and variations work by calculating distances between examples from the majority and minority classes. In order to be able to calculate such distances your data has to be formatted as a feature vector per entry. That means that categorical features must first be encoded to numerical values (e.g.: by using one hot encoding) before being passed to the object.
At the end of the end the SMOTE method (and all methods in this package for that matter) take as input a design matrix with all entries being numbers in addition to the respective labels.
Does that help?
from imbalanced-learn.
@fmfn @glemaitre
Related discussion
Oversampling with categorical variables
Weka gets a data set with categorical "C" and numerical "N" features and returns an over-sampled data set keeping the same data types:
Input
schema: [C | N | N | C | N]
samples = n
Output
schema: [C | N | N | C | N]
samples = n + ratio*minonrity_class_samples
Reference code
Weka - SMOTE.java
from imbalanced-learn.
I encoded my categorical variables to integers using panda's factorize method. But it seems like SMOTE still treated these variables as continuous and thus created new data where the entry for these categorical variables look like 0.954 or 0.145. Is that supposed to happen? I read somewhere that it may be safe just to round these numbers back to integers, but that seems a little unsafe to me. Please advise.
Thanks!
from imbalanced-learn.
Is it still only for continuous data?
from imbalanced-learn.
from imbalanced-learn.
@glemaitre Hi, I was just wondering if certain algorithms like the RandomUnderSampler, that do not calculate distances between examples from the majority and minority classes, could potentially be implemented easier to handle Categorical Variables? Thank you very much!
from imbalanced-learn.
But doesn't the SMOTE algo use majority rule to find the value of categorical variable from the neighbors being considered?
I am using SMOTE algo but it is converting a nominal variable with categories(0 and 1) into continuous values between 0 and 1.
Is there a solution, maybe a modification to the categorical variables before feeding them into the SMOTE function.
Thank you.
from imbalanced-learn.
Use SMOTENC for mix of categorical and continuous variable
from imbalanced-learn.
Hi,
Thanks for this wonderful package to handle class imbalance. I am trying to use SMOTENC but getting stuck in "memory error" during "fit_resample" method. I have already converted the dtypes and made them as small as possible even though this issue persist. On contrary if i use SMOTE it's working fine on the same data. I've 31 GB RAM and data shape is (98000,48), its around 6.5 MB on disk. I am using python 3.5 and imblearn version is '0.4.2'. can somebody suggest some hack to deal with issue. Thanks.
from imbalanced-learn.
@glemaitre Hi, is it possible to use SMOTENC for only categorical features, within which there are many categorical values?
from imbalanced-learn.
I want to request an extra feature for SMOTENC that I think will help me in application. The normal SMOTE library has a parameter called ratio. Will you be able to add that to SMOTENC?
from imbalanced-learn.
The ratio parameter has been deprecated in favour of sampling_strategy. Use sampling_strategy the same way that ratio was working: https://imbalanced-learn.readthedocs.io/en/latest/generated/imblearn.over_sampling.SMOTENC.html
from imbalanced-learn.
Oh I see - thanks. I am having a bit of trouble getting SMOTENC to fit on a pandas dataframe. A test example won't seem to work but the code on the website works. I can't seem to figure out what I'm doing wrong. Do you see anything wrong with this?
s1 = pd.Series([1,2,3,4,5,6])
s2 = pd.Series([1,2,2,9,3,5])
s3 = pd.Series([9,8,3,5,2,3])
s4 = pd.Series([0,1,1,0,1,0])
s5 = pd.Series([0,1,0,0,0,1])
df = pd.concat([s1,s2,s3,s4,s5], axis=1).rename(columns={0:'col1',1:'col2',2:'col3',3:'col4', 4:'col5'})
sm = SMOTENC(categorical_features=['col4', 'col5'])
X,y = sm.fit_resample(df[['col1','col2','col4']], df['col3'])
ValueError Traceback (most recent call last)
in
----> 1 sm = SMOTENC(categorical_features=['col4', 'col5']).fit_resample(df2[['col1','col2','col4']], df2['col3'])
~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y)
83 self.sampling_strategy, y, self._sampling_type)
84
---> 85 output = self._fit_resample(X, y)
86
87 if binarize_y:
~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/_smote.py in _fit_resample(self, X, y)
938 def fit_resample(self, X, y):
939 self.n_features = X.shape[1]
--> 940 self._validate_estimator()
941
942 # compute the median of the standard deviation of the minority class
~/anaconda3/envs/lgbm-gpu/lib/python3.6/site-packages/imblearn/over_sampling/smote.py in validate_estimator(self)
931 raise ValueError(
932 'Some of the categorical indices are out of range. Indices'
--> 933 ' should be between 0 and {}'.format(self.n_features))
934 self.categorical_features = categorical_features
935 self.continuous_features_ = np.setdiff1d(np.arange(self.n_features_),
ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 3
from imbalanced-learn.
Please open a new issue instead of commenting on a closed issue.
from imbalanced-learn.
You should pass the numerical indices and not column name as indicated in the documentation.
from imbalanced-learn.
Related Issues (20)
- [ENH] Resample test set HOT 1
- [BUG] NearMiss Version 2 Using Over 500GB RAM Memory Unlike Version 1 and Version 3 HOT 4
- Random Sampler for spatial correlation HOT 1
- KMeansSMOTE balance_threshold formula HOT 2
- Add mypy stuff HOT 1
- Columns and DataType Not Explicitly Set on line 55 of _validation.py HOT 1
- [BUG] `_transform_one` fails on sparse DataFrame
- [BUG] RandomOversampler crashes on timedelta64 column that only contains NaTs
- [BUG] `BorderlineSMOTE` takes an unusually long amount of time in later versions of scikit-learn HOT 5
- Compatibility with scikit-learn 1.4.0 HOT 3
- [BUG] `sklearn=1.4` TypeError: BaggingClassifier.__init__() got an unexpected keyword argument 'base_estimator'
- [BUG] Attribute Error: Pipeline object has no attribute "_check_fit_params" HOT 1
- [BUG] Test issues with sklearn 1.4 HOT 3
- [SO] SMOTEEN generates imbalance dataset HOT 2
- AttributeError: 'NoneType' object has no attribute 'split' HOT 3
- [BUG]sampling_strategy working incorrectly with random oversampling HOT 2
- SmoteNC leads to Killed process HOT 2
- [BUG] ImportError: cannot import name '_check_X' from 'imblearn.utils._validation' (/usr/local/lib/python3.10/dist-packages/imblearn/utils/_validation.py) HOT 2
- fix scikit-learn 1.5 parse_version link HOT 1
- Python 3.13: Two tests from test_docstring.py are failing HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from imbalanced-learn.