GithubHelp home page GithubHelp logo

alteryx / categorical_encoding Goto Github PK

View Code? Open in Web Editor NEW
50.0 50.0 15.0 1.32 MB

Repository for the research and implementation of categorical encoding into a Featuretools-compatible Python library

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.34% Jupyter Notebook 72.88% Python 26.77%

categorical_encoding's People

Contributors

alexjwang avatar gsheni avatar jeff-hernandez avatar kmax12 avatar rwedge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

categorical_encoding's Issues

API documentation?

Is there a place where I can look at the API documentation for this package? I tried in the featurelabs API doc https://featuretools.alteryx.com/en/stable/api_reference.html, but could not find anything related to categorical_encoding

How to get original values + encoded values mapping ?

Problem (dataframe having)
index weekday
0 Sunday
1 Sunday
2 Wednesday
3 Monday
4 Monday
5 Thursday
6 Tuesday


After encoding :-

index weekday
0 3
1 3
2 6
3 1
4 1
5 4
6 5

Now how do I get original + encoded values mapping like :-

{'Sunday':3, 'Wednesday':6 ..... }

Thanks ,

calculate_feature_matrix

#5 Comment:
I fixed this issue by updating pandas:
pip install pandas==0.24.0

es = make_ecommerce_entityset()
f1 = ft.Feature(es["log"]["product_id"])
f2 = ft.Feature(es["log"]["purchased"])
f3 = ft.Feature(es["log"]["value"])
f4 = ft.Feature(es["log"]["countrycode"])

features = [f1, f2, f3, f4]
ids = [0, 1, 2, 3, 4, 5]
feature_matrix = ft.calculate_feature_matrix(features, es,
instance_ids=ids)
print(feature_matrix)

The error:


AttributeError Traceback (most recent call last)
in
1 feature_matrix = ft.calculate_feature_matrix(features, es,
----> 2 instance_ids=ids)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calculate_feature_matrix(features, entityset, cutoff_time, instance_ids, entities, relationships, cutoff_time_in_index, training_window, approximate, save_progress, verbose, chunk_size, n_jobs, dask_kwargs, progress_callback)
281
282 # ensure rows are sorted by input order
--> 283 feature_matrix = feature_matrix.reindex(pd.MultiIndex.from_frame(cutoff_time[["instance_id", "time"]],
284 names=feature_matrix.index.names))
285
AttributeError: type object 'MultiIndex' has no attribute 'from_frame'

M-Estimate

Describe the encoding method below. Attach any relevant links that reference the encoding method.
Very similar to Target Encoding--only difference is that it has only one tunable parameter (m) versus target encoder's two tunable parameters (min_samples_leaf and smoothing).
https://contrib.scikit-learn.org/categorical-encoding/mestimate.html

Describe the encoder class method. Any additional functions aside from the essential fit(), transform(), and get_features()? For example, Hashing Encoder has get_hash_method().
Similar to Target Encoding.

Describe the encoder primitive for use with Featuretools.
Should have a mapping to encode any values in the dataframe column into its appropriate weighted average.

Describe the use cases in which this encoder would be useful (what kinds of data, high-cardinality, etc.).
Useful in high-cardinality data where one-hot encoding and other similar high-dimensionality resulting encoders do not work. Works in the same situations that Target Encoding does, but could be useful if Target's aforementioned parameters do not work for the situation.

Input type?
[Categorical]

Output type?
Numeric

List third party libraries required:
category-encoders

Describe encoding method's behavior with train, test, and new data.
Use train to learn the averages, test to validate the encoding and ML models, and new data will be encoded based off of the fitted encoder from the train data step.

Test cases.
np.nan

ValueError: Length mismatch with Tests

  • While running the tests locally for branch fix_requirements_jupyter, I am getting the following error:
Traceback (most recent call last):
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 189, in __require_data
    data_part = self.hashing_trick(X_in=data_part, hashing_method=self.hash_method, N=self.n_components, cols=self.cols)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 377, in hashing_trick
    X_cat.columns = new_cols
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
    self._data.set_axis(axis, labels)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 2 elements, new values have 8 elements
Traceback (most recent call last):
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 189, in __require_data
    data_part = self.hashing_trick(X_in=data_part, hashing_method=self.hash_method, N=self.n_components, cols=self.cols)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 377, in hashing_trick
    X_cat.columns = new_cols
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
    self._data.set_axis(axis, labels)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 2 elements, new values have 8 elements

Weights of Evidence

Describe the encoding method below. Attach any relevant links that reference the encoding method.
Weights of Evidence (WoE) tells the predictive power of an independent variable in relation to the dependent variable through the formula: $$\text{WoE} = \ln{\frac{\text{Distribution of non-events}}{\text{Distribution of events}}}.$$

WOE is especially useful in certain cases because similar WOE's imply similar categories, which could help with the accuracy of a machine learning algorithm.

Read more about WoE here.

Describe the encoder class method. Any additional functions aside from the essential fit(), transform(), and get_features()?
None for now. May need additional functions in order to integrate with feature calculation.

Describe the encoder primitive for use with Featuretools.
Passes mapping to encoder primitive, which then encodes the column of categoricals.

Describe the use cases in which this encoder would be useful (what kinds of data, high-cardinality, etc.).
Was originally created for use in credit fraud detection. Particularly good for binary situations ("good" and "bad" statuses).

Input type?
possibly sigma, regularization

Output type?
Numeric

List third party libraries required:
category encoders

Describe encoding method's behavior with train, test, and new data.
Similar to other Bayesian encoders. Fit on train, transform with learned mappings on test and new data.

Test cases.
np.nan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.