GithubHelp home page GithubHelp logo

vic-ai / kmedoid-discretizer Goto Github PK

View Code? Open in Web Editor NEW
0.0 4.0 0.0 26 KB

Adaptative Kmedoid Discretizer for numerical feature engineering (Sklearn compatible). Aternative to KBinsDiscretizer for binning.

License: MIT License

Makefile 4.29% Python 95.71%

kmedoid-discretizer's Introduction

kmedoid-discretizer

Adaptative Kmedoid discretizer for numerical feature engineering.

Poetry scikit-learn Python Test License: MIT

Description

kmedoid-discretizer (Adaptative Kmedoid discretizer) allows to discritize numerical feature into n_bins using Kmedoids Clustering algrorithm compatible sklearn (Alternative to sklearn KBinsDiscretizer). With this implemenation, we can have:

  • A custom number of bins for each numeral feature. Kmedoids will be run for each columns.
  • Adapt the number of bins dynamically whenever this one is two high (more precesly when two centroids are assigned to the same data point.)
  • Multiple Backends are possible: serial, multiprocessing, and ray to speed up the Kmedoids compuation.
  • Mainly use Pandas DataFrame and Numpy array.

Install

pip install git+ssh://[email protected]/Vic-ai/kmedoid-discretizer.git

Play with the code and run it locally without pip

git clone [email protected]:Vic-ai/kmedoid-discretizer.git

Usage

Basic Usage

Here is the Basic use-case data

# Fake training set
X = pd.DataFrame.from_dict({f"feature": [1, 2, 2, 3]})
# Fake Testing set
X_test = pd.DataFrame.from_dict({f"feature": [0, 2, 5]})

Ordinal encoding

discretizer = KmedoidDiscretizer(2)
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)
   feature
0        0
1        0
2        0
3        1
   feature
0        0
1        0
2        1

Onehot encoding

discretizer = KmedoidDiscretizer(2, encoding="onehot-dense")
# discritize X into 2 bins => 1 and 2 will go in bin 0 and 3 in bin 1.
X_discrete = discretizer.fit_transform(X)
print(X_discrete)
# discritize X_test into 2 bins => 0 and 2 will go in bin 0 and 5 in bin 1.
X_test_discrete = discretizer.transform(X_test)
print(X_test_discrete)
   index    0    1
0      0  1.0  0.0
1      1  1.0  0.0
2      2  1.0  0.0
3      3  0.0  1.0
   index    0    1
0      0  1.0  0.0
1      1  1.0  0.0
2      2  0.0  1.0

Advanced Usage Titanic (Sklearn Pipeline)

Libraries

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from kmedoid_discretizer.discretizer import KmedoidDiscretizer
from kmedoid_discretizer.utils.utils_external import PandasSimpleImputer

np.random.seed(0)

Titanic Dataset

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

cat_features = ["pclass", "sex"]
num_features = ["age", "fare", "sibsp", "parch"] # The one we will discritize

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

Training Pipeline

# Numerical Transformer Pipeline
numeric_transformer = Pipeline(
    steps=[
        ("imputer", PandasSimpleImputer(strategy="median")),
        ("discretizer", KmedoidDiscretizer(
                            n_bins=[8, 5, 7, 7],
                            encode="onehot-dense",
                            backend="serial",
                            verbose=True,
                            seed=0,
                        )),
    ]
)

# Categorical Transformer Pipeline
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder()),
    ]
)

# The Combination of Numerical and Categorical
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ]
)

# Overall Pipeline preprocessor + classifier
clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression()),
    ]
)

clf.fit(X_train, y_train)
print("Train score: %.3f" % clf.score(X_train, y_train))
print("Test score: %.3f" % clf.score(X_test, y_test))
Train score: 0.802
Test score: 0.809

Contributors

Marvin Martin

Daniel Nowak

License

MIT License Vic.ai 2023

kmedoid-discretizer's People

Contributors

marvinmartin24 avatar

Watchers

Gunnar Fornes avatar Victor Melo avatar Alexander Hagerup avatar Daniel Nowak avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.