GithubHelp home page GithubHelp logo

openfe's Introduction

OpenFE: An efficient automated feature generation tool

OpenFE is a new framework for automated feature generation for tabular data. OpenFE is easy-to-use, effective, and efficient with following advantages:

  • OpenFE can discover effective candidate features for improving the learning performance of both GBDT and neural networks.
  • OpenFE is efficient and supports parallel computing.
  • OpenFE covers 23 useful and effective operators for generating candidate features.
  • OpenFE supports binary-classification, multi-classification, and regression tasks.
  • OpenFE can automatically handle missing values and categorical features.

For further details, please refer to the paper.

Extensive comparison experiments on public datasets show that OpenFE outperforms existing feature generation methods on both effectiveness and efficiency. Moreover, we validate OpenFE on the IEEE-CIS Fraud Detection Kaggle competition, and show that a simple XGBoost model with features generated by OpenFE beats 99.3% of 6351 data science teams. The features generated by OpenFE results in larger performance improvement than the features provided by the first-place team in the competition.

🔥 News

  • [2023-06-25]: The code and datasets to reproduce the results in our paper are now available at OpenFE_reproduce. Please note that the code for OpenFE in OpenFE_reproduce is not the most recent version, as it is intended solely for reproduction purposes. Typically, employing the latest version here will yield superior performance.
  • [2023-04-26]: OpenFE has been accepted by ICML2023!

🏴󠁶󠁵󠁭󠁡󠁰󠁿 Get Started and Documentation

Installation

It is recommended to use pip for installation.

pip install openfe

Please do not use conda install openfe for installation. It will install another python package different from ours.

⚡️ A Quick Example

It only takes four lines of codes to generate features by OpenFE. First, we generate features by OpenFE. Next, we augment the train and test data by the generated features.

from openfe import OpenFE, transform

ofe = OpenFE()
features = ofe.fit(data=train_x, label=train_y, n_jobs=n_jobs)  # generate new features
train_x, test_x = transform(train_x, test_x, features, n_jobs=n_jobs) # transform the train and test data according to generated features.

We provide an example using the standard california_housing dataset in this link. A more complicated example demonstrating OpenFE can outperform machine learning experts in the IEEE-CIS Fraud Detection Kaggle competition is provided in this link. Users can also refer to our documentation for more advanced usage of OpenFE and FAQ about feature generation.

openfe's People

Contributors

wowthecoder avatar zhangtp1996 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openfe's Issues

import error, after pip install openfe

I have installed the package using pip, but when import openfe, get a error.
The error is:
OSError: dlopen(/Users/yuanjunping/opt/anaconda3/lib/python3.9/site-packages/lightgbm/lib_lightgbm.so, 0x0006): Library not loaded: '/usr/local/opt/libomp/lib/libomp.dylib'
Referenced from: '/Users/yuanjunping/opt/anaconda3/lib/python3.9/site-packages/lightgbm/lib_lightgbm.so'
Reason: tried: '/usr/local/opt/libomp/lib/libomp.dylib' (no such file), '/usr/local/lib/libomp.dylib' (no such file), '/usr/lib/libomp.dylib' (no such file).
My compute Is MacBook Pro 14
python version is 3.913
Looking forward to your reply

.

.

Very high latency in even modest datasets

Hello OpenFE authors,

We would appreciate some help getting the code to finish within a reasonable amount of time. E.g. on a dataset with ~7K samples, OpenFE generates 1.3 million candidate transforms, and takes many hours to run, even on a high-memory 96-core machine (all cores are used). We already tried all suggestions outlined on https://openfe-document.readthedocs.io/en/latest/parameter_tuning.html. On larger datasets, OpenFE oftentimes never finishes or crashes.

For instance, we cannot reproduce the latency results reported in the paper.

We are following examples used in this repository. Was there anything done differently for the paper that we can use to run the code in this repo?

Thanks,
Yihe

A process in the process pool was terminated abruptly while the future was running or pending

Hi, thanks for the sharing. And while I tried OpenFE, there was one error:

my code:
openfe_feature = ofe.fit(data = train_df, label = label_df['label'],n_jobs = 1)

and the error is :
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The error page shows it may happened in line 708 of openfe.py res.extend(r.result()), and it seemed that the result() function cannot get the correct result and broke down.

I tried to reduce the dimension of my data, like 10 samples and 10 potential features. But still have the same error.

The python I used is 3.9.7, and the environment is Linux.

a question of transform function

Hello, I appreciate your wonderful work! I have a question about transform(). You said in api document that train and test data need to be transformed together because there are global operators such as ‘GroupByThenMean’. I have multiple test sets and one training set, if I call transform () using the training set and different test sets several times, are the transformed training sets the same? If not, how can I make them identical? Thanks a lot, waiting for you reply!

global variable failed in multiprocess task

I run the sample code of california housing, and get this error. I am on Windows platform.

It appears that the code global _data don't work.....

data_temp = _data.loc[train_idx + val_idx]
NameError: name '_data' is not defined

Missing Value Filling Problem

Say I have two columns A and B, and openFE will calculate a column C = A * B. But I found that when a record of column A or B is null, the result of this record of Column C will be filled by 0. Is 0 a default fillna value? Can I change this default or leave this record of column C null if A or B is null?

local variable 'score' referenced before assignment

Hello,I have the following problem when i use it:
\openfe\openfe.py", line 662, in _calculate_and_evaluate_multiprocess
\openfe\openfe.py", line 595, in _evaluate return score
UnboundLocalError: local variable 'score' referenced before assignment
How can I solve it
thank you

process pool

A process in the process pool was terminated abruptly while the future was running or pending。on Mac

IndexError: string index out of range

IndexError Traceback (most recent call last)
/tmp/ipykernel_27/17403233.py in
2
3 ofe = openfe()
----> 4 features = ofe.fit(data=train_x, label=train_y, n_jobs=10) # generate new features
5 train_x, test_x = transform(train_x, test_x, features, n_jobs=10) # transform the train and test data according to generated features.

/opt/conda/lib/python3.7/site-packages/openfe/openfe.py in fit(self, data, label, task, train_index, val_index, candidate_features_list, init_scores, categorical_features, metric, drop_columns, n_data_blocks, min_candidate_features, feature_boosting, stage1_metric, stage2_metric, stage2_params, is_stage1, n_repeats, tmp_save_path, n_jobs, seed, verbose)
300 self.myprint(f"The number of remaining candidate features is {len(self.candidate_features_list)}")
301 self.myprint("Start stage II selection.")
--> 302 self.new_features_scores_list = self.stage2_select()
303 self.new_features_list = [feature for feature, _ in self.new_features_scores_list]
304 for node, score in self.new_features_scores_list:

/opt/conda/lib/python3.7/site-packages/openfe/openfe.py in stage2_select(self)
529 if self.stage2_metric == 'gain_importance':
530 for i, imp in enumerate(gbm.feature_importances_[:len(new_features)]):
--> 531 results.append([formula_to_tree(new_features[i]), imp])
532 elif self.stage2_metric == 'permutation':
533 r = permutation_importance(gbm, val_x, val_y,

/opt/conda/lib/python3.7/site-packages/openfe/utils.py in formula_to_tree(string)
52 p1 = find_prev(string[:p2-1])
53 if string[0] == '(':
---> 54 return Node(string[p2-1], [formula_to_tree(string[p1:p2 - 1]), formula_to_tree(string[p2:-1])])
55 else:
56 return Node(string[:p1-1], [formula_to_tree(string[p1:p2 - 1]), formula_to_tree(string[p2:-1])])

/opt/conda/lib/python3.7/site-packages/openfe/utils.py in formula_to_tree(string)
28
29 def formula_to_tree(string):
---> 30 if string[-1] != ')':
31 return FNode(string)
32

IndexError: string index out of range

lightgbm loggings

我在Jupyter notebook中运行california测试代码,出现了很多lgbm的logging,我不想看到这些输出,我用了常规方法(import logging)没能解决,特来请教,logging如下示例:
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000082 seconds.
You can set force_col_wise=true to remove the overhead.[LightGBM] [Info] Total Bins 255

[LightGBM] [Info] Total Bins 255
[LightGBM] [Info] Number of data points in the train set: 7231, number of used features: 1[LightGBM] [Info] Number of data points in the train set: 7231, number of used features: 1

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000084 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000086 seconds.
You can set force_col_wise=true to remove the overhead.[LightGBM] [Info] Total Bins 255

[LightGBM] [Info] Total Bins 255[LightGBM] [Info] Number of data points in the train set: 7231, number of used features: 1[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000081 seconds.
You can set force_col_wise=true to remove the overhead.[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000082 seconds.
You can set force_col_wise=true to remove the overhead.

SystemExit: None

When I tried openFE, the error I got as below.
n_jobs = 4
X_train,X_test,y_train,y_test= train_test_split(X,y, test_size=0.2, random_state=1)
ofe = openfe()
ofe.fit(data=X_train,label=y_train, n_jobs=n_jobs)
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "E:\ml\lib\site-packages\openfe\openfe.py", line 676, in _calculate_and_evaluate_multiprocess
init_metric = self.get_init_metric(val_init, val_y)
File "E:\ml\lib\site-packages\openfe\openfe.py", line 545, in get_init_metric
init_metric = log_loss(label, scipy.special.softmax(pred, axis=1))
File "E:\ml\lib\site-packages\sklearn\metrics_classification.py", line 2424, in log_loss
raise ValueError(
ValueError: y_true and y_pred contain different number of classes 9, 10. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [ 1 3 4 5 6 7 8 9 10]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:\ml\lib\concurrent\futures\process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "E:\ml\lib\site-packages\openfe\openfe.py", line 684, in _calculate_and_evaluate_multiprocess
exit()
File "E:\ml\lib_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: None
"""

The above exception was the direct cause of the following exception:

SystemExit Traceback (most recent call last)
in

E:\ml\lib\site-packages\openfe\openfe.py in fit(self, data, label, task, train_index, val_index, candidate_features_list, init_scores, categorical_features, metric, drop_columns, n_data_blocks, min_candidate_features, feature_boosting, stage1_metric, stage2_metric, stage2_params, is_stage1, n_repeats, tmp_save_path, n_jobs, seed, verbose)
297 self.myprint(f"The number of candidate features is {len(self.candidate_features_list)}")
298 self.myprint("Start stage I selection.")
--> 299 self.candidate_features_list = self.stage1_select()
300 self.myprint(f"The number of remaining candidate features is {len(self.candidate_features_list)}")
301 self.myprint("Start stage II selection.")

E:\ml\lib\site-packages\openfe\openfe.py in stage1_select(self, ratio)
459 val_idx = val_index_samples[idx]
460 idx += 1
--> 461 results = self._calculate_and_evaluate(self.candidate_features_list, train_idx, val_idx)
462 candidate_features_scores = sorted(results, key=lambda x: x[1], reverse=True)
463 candidate_features_scores = self.delete_same(candidate_features_scores)

E:\ml\lib\site-packages\openfe\openfe.py in _calculate_and_evaluate(self, candidate_features, train_idx, val_idx)
706 res = []
707 for r in results:
--> 708 res.extend(r.result())
709 return res
710

E:\ml\lib\concurrent\futures_base.py in result(self, timeout)
444 raise CancelledError()
445 elif self._state == FINISHED:
--> 446 return self.__get_result()
447 else:
448 raise TimeoutError()

E:\ml\lib\concurrent\futures_base.py in __get_result(self)
389 if self._exception:
390 try:
--> 391 raise self._exception
392 finally:
393 # Break a reference cycle with the exception in self._exception

SystemExit: None

why number of features calculated per process divided by 4

A line of code like this in _calculate_and_evaluatefunction, length = int(np.ceil(len(candidate_features) / self.n_jobs / 4)) .
But in per process, the score calculated by LGBM uses only one core n_jobs=1 in _evaluate fuction. So is there a specific meaning to the divisor 4?

'AttributeError: 'tuple' object has no attribute 'tb_frame'

Hi,

I'm using OpenFE to extract features, but when I transform the data after extracting features, I encounter the following error:

'AttributeError: 'tuple' object has no attribute 'tb_frame' .

I'm running Python version 3.11 and have installed all the required dependencies. Despite this, I can't find a solution to the issue. How can I fix this problem?

Code snippet :

> ofe= OpenFE()
> candidate_features_list = get_candidate_features(numerical_features=list(test.columns))
> features = ofe.fit(data=train.drop('Target', axis=1), label=train['Target'],
>                     candidate_features_list=candidate_features_list, metric='multi_logloss', task='classification', stage2_params=params,
>                     min_candidate_features=5000,
>                     n_jobs=n_jobs, n_data_blocks=2, feature_boosting=True)
> 
> train_ft1, test_ft1 = transform(train.drop(target, axis=1), test, features, n_jobs=n_jobs)

How to close Lightgbm's messages ?

When I use the fit() function, it always prints too many messages from LightGBM, and it seems I can't turn them off by setting parameters. Is there any solution? Please help!
image

pyarrow.lib.ArrowInvalid: Field named proBNP is not found

When running train_x, test_x = transform(train_x, test_x, features, n_jobs=16), the following error occurs. The same error occurs on both environment based on python 3.9 and python 3.11. When changing the number of n_jobs, "pyarrow.lib.ArrowInvalid: Field named proBNP is not found" may change to "pyarrow.lib.ArrowInvalid: Field named NT is not found". It will be awesome if you can help solve this, thank you!

_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py", line 102, in _cal
_data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index')
File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pandas\io\feather_format.py", line 126, in read_feather
return feather.read_feather(
File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pyarrow\feather.py", line 226, in read_feather
return (read_table(
File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\pyarrow\feather.py", line 262, in read_table
table = reader.read_names(columns)
File "pyarrow_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Field named proBNP is not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures\process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py", line 111, in _cal
exit()
File "D:\anaconda\envs\pytorch_gpu\lib_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: None
"""

The above exception was the direct cause of the following exception:

SystemExit Traceback (most recent call last)
[... skipping hidden 1 frame]

Cell In[26], line 1
----> 1 train_x, test_x = transform(train_x, test_x, features, n_jobs=4)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\openfe\utils.py:147, in transform(X_train, X_test, new_features_list, n_jobs, name)
146 for i, res in enumerate(results):
--> 147 is_cat, d1, d2, f = res.result()
148 names.append('autoFE_f_%d' % i + name)

File D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures_base.py:439, in Future.result(self, timeout)
438 elif self._state == FINISHED:
--> 439 return self.__get_result()
441 self._condition.wait(timeout)

File D:\anaconda\envs\pytorch_gpu\lib\concurrent\futures_base.py:391, in Future.__get_result(self)
390 try:
--> 391 raise self._exception
392 finally:
393 # Break a reference cycle with the exception in self._exception

SystemExit: None

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last)
[... skipping hidden 1 frame]

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\interactiveshell.py:2095, in InteractiveShell.showtraceback(self, exc_tuple, filename, tb_offset, exception_only, running_compiled_code)
2092 if exception_only:
2093 stb = ['An exception has occurred, use %tb to see '
2094 'the full traceback.\n']
-> 2095 stb.extend(self.InteractiveTB.get_exception_only(etype,
2096 value))
2097 else:
2098 try:
2099 # Exception classes can customise their traceback - we
2100 # use this in IPython.parallel for exceptions occurring
2101 # in the engines. This should return a list of strings.

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:696, in ListTB.get_exception_only(self, etype, value)
688 def get_exception_only(self, etype, value):
689 """Only print the exception type and message, without a traceback.
690
691 Parameters
(...)
694 value : exception value
695 """
--> 696 return ListTB.structured_traceback(self, etype, value)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:559, in ListTB.structured_traceback(self, etype, evalue, etb, tb_offset, context)
556 chained_exc_ids.add(id(exception[1]))
557 chained_exceptions_tb_offset = 0
558 out_list = (
--> 559 self.structured_traceback(
560 etype,
561 evalue,
562 (etb, chained_exc_ids), # type: ignore
563 chained_exceptions_tb_offset,
564 context,
565 )
566 + chained_exception_message
567 + out_list)
569 return out_list

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1396, in AutoFormattedTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
1394 else:
1395 self.tb = etb
-> 1396 return FormattedTB.structured_traceback(
1397 self, etype, evalue, etb, tb_offset, number_of_lines_of_context
1398 )

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1287, in FormattedTB.structured_traceback(self, etype, value, tb, tb_offset, number_of_lines_of_context)
1284 mode = self.mode
1285 if mode in self.verbose_modes:
1286 # Verbose modes need a full traceback
-> 1287 return VerboseTB.structured_traceback(
1288 self, etype, value, tb, tb_offset, number_of_lines_of_context
1289 )
1290 elif mode == 'Minimal':
1291 return ListTB.get_exception_only(self, etype, value)

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1140, in VerboseTB.structured_traceback(self, etype, evalue, etb, tb_offset, number_of_lines_of_context)
1131 def structured_traceback(
1132 self,
1133 etype: type,
(...)
1137 number_of_lines_of_context: int = 5,
1138 ):
1139 """Return a nice text document describing the traceback."""
-> 1140 formatted_exception = self.format_exception_as_a_whole(etype, evalue, etb, number_of_lines_of_context,
1141 tb_offset)
1143 colors = self.Colors # just a shorthand + quicker name lookup
1144 colorsnormal = colors.Normal # used a lot

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1030, in VerboseTB.format_exception_as_a_whole(self, etype, evalue, etb, number_of_lines_of_context, tb_offset)
1027 assert isinstance(tb_offset, int)
1028 head = self.prepare_header(str(etype), self.long_header)
1029 records = (
-> 1030 self.get_records(etb, number_of_lines_of_context, tb_offset) if etb else []
1031 )
1033 frames = []
1034 skipped = 0

File D:\anaconda\envs\pytorch_gpu\lib\site-packages\IPython\core\ultratb.py:1098, in VerboseTB.get_records(self, etb, number_of_lines_of_context, tb_offset)
1096 while cf is not None:
1097 try:
-> 1098 mod = inspect.getmodule(cf.tb_frame)
1099 if mod is not None:
1100 mod_name = mod.name

AttributeError: 'tuple' object has no attribute 'tb_frame'

Y label Error.

When I tried openFE, the error I got as below.
ofe = openfe() features = ofe.fit(data=X_train_feature, label=Y_train.ravel(), n_jobs=1) D:\miniconda3\lib\site-packages\sklearn\preprocessing\_label.py:133: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
I guess the source code of OpenFE has not the corresponding processing about label data.

pyarrow.lib.ArrowInvalid: Field named <column_name> is not found

The following code produced an error on the transform function. The fit function works correctly. This error is reproduced for every feature in the original dataset.

X = df.drop(["Target"], axis=1)
y = df["Target"]

ofe = OpenFE()
ofe.fit(data=X, label=y, categorical_features=cat_cols, n_jobs=11)

train_x, test_x = transform(X, test, ofe.new_features_list[:50], n_jobs=11 )
Traceback (most recent call last):
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/openfe/utils.py", line 102, in _cal
    _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index')
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/pandas/io/feather_format.py", line 124, in read_feather
    return feather.read_feather(
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/pyarrow/feather.py", line 226, in read_feather
    return (read_table(
  File "/home/ben/projects/kaggle/.venv/lib/python3.10/site-packages/pyarrow/feather.py", line 262, in read_table
    table = reader.read_names(columns)
  File "pyarrow/_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Field named evaluations is not found

How to reproduce table 3 (main results table)?

Thanks for sharing this interesting work!

I have 2 questions:

  1. Are there download links to the datasets used in Table 3? Specifically I cannot find datasets such as Microsoft, Telecom, Broken Machine etc online for download
  2. Is there code associated with it to reproduce table 3?

image

Thanks!

pyarrow.lib.ArrowInvalid: Field named openfe_index is not found

Hi, I have a problem when I try OpenFE on my own dataset.

Traceback (most recent call last):
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/openfe/openfe.py", line 669, in _calculate_and_evaluate_multiprocess
    data = pd.read_feather(self.tmp_save_path, columns=list(base_features)).set_index('openfe_index')
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/pandas/io/feather_format.py", line 127, in read_feather
    return feather.read_feather(
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/pyarrow/feather.py", line 226, in read_feather
    return (read_table(
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/pyarrow/feather.py", line 262, in read_table
    table = reader.read_names(columns)
  File "pyarrow/_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Field named openfe_index is not found

It looks like the function _calculate_and_evaluate_multiprocess is trying to read train_x at
data = pd.read_feather(self.tmp_save_path, columns=list(base_features)).set_index('openfe_index')
while train_x does not have a column named 'openfe_index', which is added to the list variable base_features.

I am not clear about the reason of adding extra column name 'openfe_index'. But when I running the example code, there is no such problem. My input data has been checked. All my Input are pandas.Dataframe and have no columns type problem.

math expressions of features

As I'd like to use new generated features for further clustering or model interpretability, their math expressions are needed. How can I simply fetch math expressions of new features? thanks.

(Model interpretability is important. People won't use those features they don't know why)

multi-columns in category-num operators

when i use operator like groupbythenxxx.d1 is numerical data and d2 is the category data.
if d2 contains one feauture ,for example groupby('date'),it's okay.
however,if d2 contains multi features like groupby(['date','class']) it seems doesn't work.
i've read the codes in FNode, function .calcualte return data[self.name], seems name only support one column.

is it normal to be so slow?

2b2be996f4ffb08751f1b583d98cec1
I have been wating for around ten mins, and there is no respond, is it normal?
So generally, how long should we wait?
my X.shape is (49810800, 5)

Can I use other models

hello,I have two questions. I hope you can help me answer them.
1.In stage1 and stage2 select ,can I use other models ,such as xgboost,rf,adaboost?
2.My base model is LR. Can I use openfe to generate features?

OpenFE stuck at stage I selection

I've just installed the package, but when I try the sample code California housing, the process stuck at stage I
image
It's been like this for nearly for quite a long time, so a little sus...

Here's my other packages' version:
python 3.10.9
numpy 1.23.5
pandas 1.5.3
scikit-learn 1.2.1
lightgbm 4.1.0
scipy 1.10.0
xgboost 2.0.0
tqdm 4.64.1
pyarrow 13.0.0

california housing example

I run the sample code 'california housing' and received the following issues:

Traceback (most recent call last):
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/openfe/openfe.py", line 610, in _evaluate
gbm.fit(train_x, train_y, init_score=train_init,
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/sklearn.py", line 818, in fit
super().fit(X, y, sample_weight=sample_weight, init_score=init_score,
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/sklearn.py", line 683, in fit
self._Booster = train(params, train_set,
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/engine.py", line 228, in train
booster = Booster(params=params, train_set=train_set)
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/basic.py", line 2229, in init
train_set.construct()
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/basic.py", line 1468, in construct
self._lazy_init(self.data, label=self.label,
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/basic.py", line 1294, in _lazy_init
self.set_init_score(init_score)
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/basic.py", line 1843, in set_init_score
init_score = list_to_1d_numpy(init_score, np.float64, name='init_score')
File "/Users/quanminghang/opt/anaconda3/envs/SQquant/lib/python3.8/site-packages/lightgbm/basic.py", line 164, in list_to_1d_numpy
raise TypeError("Wrong type({0}) for {1}.\n"
TypeError: Wrong type(DataFrame) for init_score.
It should be list, numpy 1-D array or pandas Series

Running the code in my own dataset with error

Hi Today I try to run the code in my own dataset, the data has 31 features and they are all numbers.
The task is regression.
When I run the code, I have the following issues
The number of candidate features is 9329
6%|▋ | 2/32 [00:01<00:20, 1.46it/s]
Output exceeds the size limit. Open the full output data in a text editor

_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "c:\Users\Zhong\AppData\Local\Programs\Python\Python310\lib\site-packages\openfe\openfe.py", line 662, in _calculate_and_evaluate_multiprocess
score = self._evaluate(candidate_feature, train_y, val_y, train_init, val_init, init_metric)
File "c:\Users\Zhong\AppData\Local\Programs\Python\Python310\lib\site-packages\openfe\openfe.py", line 595, in _evaluate
return score
UnboundLocalError: local variable 'score' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\Users\Zhong\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\process.py", line 243, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "c:\Users\Zhong\AppData\Local\Programs\Python\Python310\lib\site-packages\openfe\openfe.py", line 667, in _calculate_and_evaluate_multiprocess
exit()
File "c:\Users\Zhong\AppData\Local\Programs\Python\Python310\lib_sitebuiltins.py", line 26, in call
raise SystemExit(code)
SystemExit: None
"""

The above exception was the direct cause of the following exception:
...
170 if isinstance(error, str):
171 error = AssertionError(error)
--> 172 raise error

AssertionError:

Exception when `-` in column name during transform

The following example on AdultIncome dataset fails due to - in feature names (for example, capital-gain).

import pandas as pd
from openfe import OpenFE, transform


if __name__ == '__main__':
    label = 'class'
    train_data = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
    test_data = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')

    train_y = train_data[[label]]
    train_x = train_data.drop(columns=[label])
    test_y = test_data[[label]]
    test_x = test_data.drop(columns=[label])

    train_x = train_x[['age', 'sex', 'capital-gain']]
    test_x = test_x[['age', 'sex', 'capital-gain']]

    # FIXME: Uncomment to avoid exception
    # train_x.columns = ['age', 'sex', 'capital']
    # test_x.columns = ['age', 'sex', 'capital']

    ofe = OpenFE()
    features = ofe.fit(data=train_x, label=train_y, n_jobs=4)
    train_x, test_x = transform(X_train=train_x, X_test=test_x, new_features_list=features, n_jobs=4)

The error looks like this:

Traceback (most recent call last):
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/openfe/utils.py", line 102, in _cal
    _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index')
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/pandas/io/feather_format.py", line 132, in read_feather
    return feather.read_feather(
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/pyarrow/feather.py", line 226, in read_feather
    return (read_table(
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/pyarrow/feather.py", line 262, in read_table
    table = reader.read_names(columns)
  File "pyarrow/_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Field named gain is not found

Traceback (most recent call last):
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/openfe/utils.py", line 102, in _cal
    _data = pd.read_feather('./openfe_tmp_data.feather', columns=base_features).set_index('openfe_index')
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/pandas/io/feather_format.py", line 132, in read_feather
    return feather.read_feather(
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/pyarrow/feather.py", line 226, in read_feather
    return (read_table(
  File "/home/ubuntu/.conda/envs/code/lib/python3.8/site-packages/pyarrow/feather.py", line 262, in read_table
    table = reader.read_names(columns)
  File "pyarrow/_feather.pyx", line 114, in pyarrow._feather.FeatherReader.read_names
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Field named capital is not found

Proposed Fix:

Either add support for feature names with -, or sanitize internally the column names to avoid the error.

You may also want to be wary of the edge-case scenario with 3 feature names : A, B, A-B

explanation of generated features?

作者们好!感谢你们提供了有用的工具
我看到源码里的一些特征生成的operator
请问能否考虑增加一个数学解释的函数,对每个生成的特征进行数学上的解释,比如说特征X是a的平方根➕b的平方根这样?就是让每个特征得到解释
因为我写业务论文的时候,这些高级特征的可解释性可能就是论文的创新点,感谢!

Reporting Bug : subsampled train_index and test_index can have duplicated values

Hi, Tianpin. During my recent test on classification problem using a relatively small dataset, an error has been raised in the stage_1_select process.

Traceback (most recent call last):
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/openfe/openfe.py", line 589, in _evaluate
    gbm.fit(train_x, train_y, init_score=train_init,
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/sklearn.py", line 967, in fit
    super().fit(X, _y, sample_weight=sample_weight, init_score=init_score, eval_set=valid_sets,
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/sklearn.py", line 748, in fit
    self._Booster = train(
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/basic.py", line 2605, in __init__
    train_set.construct()
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/basic.py", line 1815, in construct
    self._lazy_init(self.data, label=self.label,
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/basic.py", line 1557, in _lazy_init
    self.set_label(label)
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/basic.py", line 2164, in set_label
    self.set_field('label', label)
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/basic.py", line 1993, in set_field
    _safe_call(_LIB.LGBM_DatasetSetField(
  File "/mnt/lhc/anaconda3/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Length of label is not same with #data

This is caused by the sampling function _subsample, which can generate subsamples with duplicated index. Specifically, at line from 458 to 461,

        train_index_samples = _subsample(self.train_index, self.n_data_blocks)
        val_index_samples = _subsample(self.val_index, self.n_data_blocks)
        idx = 0
        train_idx = train_index_samples[idx]
        val_idx = val_index_samples[idx]

train_idx and val_idx may have some index both contained by each other. And this can cause an error.

In _calculate_and_evaluate_multiprocess at line 670, by slicing dataframe like this :

        data_temp = data.loc[train_idx + val_idx]

data_temp will have a dupliceted index. Thus, train_x = pd.DataFrame(candidate_feature.data.loc[train_y.index]) will lead to duplicated index in train_x and lightgbm.basic.LightGBMError: Length of label is not same with #data

Y label issue

When I try classification task in small-scale (normally within 1M) classification dataset, there would be error as mentioned in this pr #4, while I have checked that data and label are ensured with same index and processed with the datatype manually. But in regression tasks or larger classification dataset the error would not occur. Does anyone have similar issue still?

运行报错

raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

25%|█████████████████████ | 1/4 [00:04<00:14, 4.81s/it]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.