alteryx / categorical_encoding Goto Github PK
View Code? Open in Web Editor NEWRepository for the research and implementation of categorical encoding into a Featuretools-compatible Python library
License: BSD 3-Clause "New" or "Revised" License
Repository for the research and implementation of categorical encoding into a Featuretools-compatible Python library
License: BSD 3-Clause "New" or "Revised" License
Is there a place where I can look at the API documentation for this package? I tried in the featurelabs API doc https://featuretools.alteryx.com/en/stable/api_reference.html
, but could not find anything related to categorical_encoding
Release new version (1.0.0?) that is compatible with Featuretools 1.0.0. This should be released before Featuretools, making sure to set the Featuretools version to >=1.0.0 in requirements.txt to prevent installation before featuretools is released.
Problem (dataframe having)
index weekday
0 Sunday
1 Sunday
2 Wednesday
3 Monday
4 Monday
5 Thursday
6 Tuesday
After encoding :-
index weekday
0 3
1 3
2 6
3 1
4 1
5 4
6 5
Now how do I get original + encoded values mapping like :-
{'Sunday':3, 'Wednesday':6 ..... }
Thanks ,
Docker Hub is adding rate limits that may impact Circle CI users in the future - circle ci article.
Example PR: alteryx/nlp_primitives#44
#5 Comment:
I fixed this issue by updating pandas:
pip install pandas==0.24.0
es = make_ecommerce_entityset()
f1 = ft.Feature(es["log"]["product_id"])
f2 = ft.Feature(es["log"]["purchased"])
f3 = ft.Feature(es["log"]["value"])
f4 = ft.Feature(es["log"]["countrycode"])
features = [f1, f2, f3, f4]
ids = [0, 1, 2, 3, 4, 5]
feature_matrix = ft.calculate_feature_matrix(features, es,
instance_ids=ids)
print(feature_matrix)
The error:
AttributeError Traceback (most recent call last)
in
1 feature_matrix = ft.calculate_feature_matrix(features, es,
----> 2 instance_ids=ids)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in calculate_feature_matrix(features, entityset, cutoff_time, instance_ids, entities, relationships, cutoff_time_in_index, training_window, approximate, save_progress, verbose, chunk_size, n_jobs, dask_kwargs, progress_callback)
281
282 # ensure rows are sorted by input order
--> 283 feature_matrix = feature_matrix.reindex(pd.MultiIndex.from_frame(cutoff_time[["instance_id", "time"]],
284 names=feature_matrix.index.names))
285
AttributeError: type object 'MultiIndex' has no attribute 'from_frame'
Describe the encoding method below. Attach any relevant links that reference the encoding method.
Very similar to Target Encoding--only difference is that it has only one tunable parameter (m) versus target encoder's two tunable parameters (min_samples_leaf and smoothing).
https://contrib.scikit-learn.org/categorical-encoding/mestimate.html
Describe the encoder class method. Any additional functions aside from the essential fit()
, transform()
, and get_features()
? For example, Hashing Encoder has get_hash_method()
.
Similar to Target Encoding.
Describe the encoder primitive for use with Featuretools.
Should have a mapping to encode any values in the dataframe column into its appropriate weighted average.
Describe the use cases in which this encoder would be useful (what kinds of data, high-cardinality, etc.).
Useful in high-cardinality data where one-hot encoding and other similar high-dimensionality resulting encoders do not work. Works in the same situations that Target Encoding does, but could be useful if Target's aforementioned parameters do not work for the situation.
Input type?
[Categorical]
Output type?
Numeric
List third party libraries required:
category-encoders
Describe encoding method's behavior with train, test, and new data.
Use train to learn the averages, test to validate the encoding and ML models, and new data will be encoded based off of the fitted encoder from the train data step.
Test cases.
np.nan
fix_requirements_jupyter
, I am getting the following error:Traceback (most recent call last):
File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 189, in __require_data
data_part = self.hashing_trick(X_in=data_part, hashing_method=self.hash_method, N=self.n_components, cols=self.cols)
File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 377, in hashing_trick
X_cat.columns = new_cols
File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
self._data.set_axis(axis, labels)
File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 2 elements, new values have 8 elements
Traceback (most recent call last):
File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/.pyenv/versions/3.6.9/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 189, in __require_data
data_part = self.hashing_trick(X_in=data_part, hashing_method=self.hash_method, N=self.n_components, cols=self.cols)
File "/categorical_encoding/venv/lib/python3.6/site-packages/category_encoders/hashing.py", line 377, in hashing_trick
X_cat.columns = new_cols
File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
self._data.set_axis(axis, labels)
File "/categorical_encoding/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 2 elements, new values have 8 elements
do you have code example how to do one hot encoding for test data
in order to be in synch/match with train data
something like this
http://fastml.com/how-to-use-pd-dot-get-dummies-with-the-test-set/
Featuretools is dropping support for python 3.5 in its next release. We should update to a newer python version to prepare for that.
Describe the encoding method below. Attach any relevant links that reference the encoding method.
Weights of Evidence (WoE) tells the predictive power of an independent variable in relation to the dependent variable through the formula:
WOE is especially useful in certain cases because similar WOE's imply similar categories, which could help with the accuracy of a machine learning algorithm.
Read more about WoE here.
Describe the encoder class method. Any additional functions aside from the essential fit(), transform(), and get_features()?
None for now. May need additional functions in order to integrate with feature calculation.
Describe the encoder primitive for use with Featuretools.
Passes mapping to encoder primitive, which then encodes the column of categoricals.
Describe the use cases in which this encoder would be useful (what kinds of data, high-cardinality, etc.).
Was originally created for use in credit fraud detection. Particularly good for binary situations ("good" and "bad" statuses).
Input type?
possibly sigma, regularization
Output type?
Numeric
List third party libraries required:
category encoders
Describe encoding method's behavior with train, test, and new data.
Similar to other Bayesian encoders. Fit on train, transform with learned mappings on test and new data.
Test cases.
np.nan
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.