alteryx / categorical_encoding Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 15.0 1.32 MB

Repository for the research and implementation of categorical encoding into a Featuretools-compatible Python library

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.34% Jupyter Notebook 72.88% Python 26.77%

categorical_encoding's People

Contributors

Stargazers

Watchers

Forkers

cleemesser letslego fagan2888 iqahaziqah audreyle zwbjtu123 pplonski arm7ai torreaopt evgeni-nikolaev isabella232 charleneleong-ai piyushagni5 yaroslav-vorobyov carmensnow

categorical_encoding's Issues

API documentation?

Is there a place where I can look at the API documentation for this package? I tried in the featurelabs API doc https://featuretools.alteryx.com/en/stable/api_reference.html, but could not find anything related to categorical_encoding

Release new version of categorical_encoding for Featuretools 1.0.0

Release new version (1.0.0?) that is compatible with Featuretools 1.0.0. This should be released before Featuretools, making sure to set the Featuretools version to >=1.0.0 in requirements.txt to prevent installation before featuretools is released.

How to get original values + encoded values mapping ?

Problem (dataframe having)
index weekday
0 Sunday
1 Sunday
2 Wednesday
3 Monday
4 Monday
5 Thursday
6 Tuesday

After encoding :-

index weekday
0 3
1 3
2 6
3 1
4 1
5 4
6 5

Now how do I get original + encoded values mapping like :-

{'Sunday':3, 'Wednesday':6 ..... }

Thanks ,

Use authenticated pulls

Docker Hub is adding rate limits that may impact Circle CI users in the future - circle ci article.

Example PR: alteryx/nlp_primitives#44

calculate_feature_matrix

#5 Comment:
I fixed this issue by updating pandas:
pip install pandas==0.24.0

es = make_ecommerce_entityset()
f1 = ft.Feature(es["log"]["product_id"])
f2 = ft.Feature(es["log"]["purchased"])
f3 = ft.Feature(es["log"]["value"])
f4 = ft.Feature(es["log"]["countrycode"])

features = [f1, f2, f3, f4]
ids = [0, 1, 2, 3, 4, 5]
feature_matrix = ft.calculate_feature_matrix(features, es,
instance_ids=ids)
print(feature_matrix)

The error:

AttributeError Traceback (most recent call last)
in
1 feature_matrix = ft.calculate_feature_matrix(features, es,
----> 2 instance_ids=ids)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calculate_feature_matrix(features, entityset, cutoff_time, instance_ids, entities, relationships, cutoff_time_in_index, training_window, approximate, save_progress, verbose, chunk_size, n_jobs, dask_kwargs, progress_callback)
281
282 # ensure rows are sorted by input order
--> 283 feature_matrix = feature_matrix.reindex(pd.MultiIndex.from_frame(cutoff_time[["instance_id", "time"]],
284 names=feature_matrix.index.names))
285
AttributeError: type object 'MultiIndex' has no attribute 'from_frame'

M-Estimate

Describe the encoding method below. Attach any relevant links that reference the encoding method.
Very similar to Target Encoding--only difference is that it has only one tunable parameter (m) versus target encoder's two tunable parameters (min_samples_leaf and smoothing).
https://contrib.scikit-learn.org/categorical-encoding/mestimate.html

Describe the encoder class method. Any additional functions aside from the essential fit(), transform(), and get_features()? For example, Hashing Encoder has get_hash_method().
Similar to Target Encoding.

Describe the encoder primitive for use with Featuretools.
Should have a mapping to encode any values in the dataframe column into its appropriate weighted average.

Describe the use cases in which this encoder would be useful (what kinds of data, high-cardinality, etc.).
Useful in high-cardinality data where one-hot encoding and other similar high-dimensionality resulting encoders do not work. Works in the same situations that Target Encoding does, but could be useful if Target's aforementioned parameters do not work for the situation.

Input type?
[Categorical]

Output type?
Numeric

List third party libraries required:
category-encoders

Describe encoding method's behavior with train, test, and new data.
Use train to learn the averages, test to validate the encoding and ML models, and new data will be encoded based off of the fitted encoder from the train data step.

Test cases.
np.nan

Move from CircleCI to GitHub Actions

The CI workflow should be moved from CircleCI to Github Actions.

Add additional python versions in CI tests workflow

We should tests that categorical encoding supports Python 3.6, 3.7, 3.8, and 3.9
Once it does, we should update setup.py

Update categorical_encoding to use woodwork typing

This add-on library should updated with the new function calls, parameters, and the accessor approach.
For example, the OneHotEncodng here: https://github.com/alteryx/categorical_encoding/blob/597fa91259a94f94e5e38804cfd751d96616b946/categorical_encoding/primitives/one_hot_enc.py

ValueError: Length mismatch with Tests

While running the tests locally for branch fix_requirements_jupyter, I am getting the following error:

Traceback (most recent call last):
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 189, in __require_data
    data_part = self.hashing_trick(X_in=data_part, hashing_method=self.hash_method, N=self.n_components, cols=self.cols)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 377, in hashing_trick
    X_cat.columns = new_cols
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
    self._data.set_axis(axis, labels)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 2 elements, new values have 8 elements
Traceback (most recent call last):
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 189, in __require_data
    data_part = self.hashing_trick(X_in=data_part, hashing_method=self.hash_method, N=self.n_components, cols=self.cols)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 377, in hashing_trick
    X_cat.columns = new_cols
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
    self._data.set_axis(axis, labels)
  File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 2 elements, new values have 8 elements

do you have code example how to do one hot encoding for test data

in order to be in synch/match with train data

something like this

http://fastml.com/how-to-use-pd-dot-get-dummies-with-the-test-set/

Update CI tests to use python 3.6+

Featuretools is dropping support for python 3.5 in its next release. We should update to a newer python version to prepare for that.

Weights of Evidence

Describe the encoding method below. Attach any relevant links that reference the encoding method.
Weights of Evidence (WoE) tells the predictive power of an independent variable in relation to the dependent variable through the formula: $$\text{WoE} = \ln{\frac{\text{Distribution of non-events}}{\text{Distribution of events}}}.$$

WOE is especially useful in certain cases because similar WOE's imply similar categories, which could help with the accuracy of a machine learning algorithm.

alteryx / categorical_encoding Goto Github PK

categorical_encoding's People

Contributors

Stargazers

Watchers

Forkers

categorical_encoding's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs