danilzherebtsov / verstack Goto Github PK

View Code? Open in Web Editor NEW

91.0 91.0 10.0 18.96 MB

License: MIT License

Python 100.00%

verstack's People

Contributors

Stargazers

Watchers

Forkers

kensingtonska pjcafonso goteguru omvishal1 tdl77 valeman h-schot jbarsotti harel-coffee dluks

verstack's Issues

could not convert string to float: 'x' - using FeatureSelector

When trying to use FeatureSelector I got "" message.

Command I use (python 3.10):

from verstack import FeatureSelector
FS = FeatureSelector(objective = 'classification', auto = True)
selected_feats = FS.fit_transform(X_encoded, y)

Error call stack:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[24], line 3
      1 from verstack import FeatureSelector
      2 FS = FeatureSelector(objective = 'classification', auto = True)
----> 3 selected_feats = FS.fit_transform(X_encoded, y)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\tools.py:19, in timer.<locals>.wrapped(*args, **kwargs)
     16 @wraps(func)
     17 def wrapped(*args, **kwargs):
     18     start = time.time()
---> 19     result = func(*args, **kwargs)
     20     end = time.time()
     21     elapsed = round(end-start,5)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:232, in FeatureSelector.fit_transform(self, X, y, **kwargs)
    230 if self.auto:
    231     self.printer.print(f'Comparing LinearRegression and RandomForest for feature selection', order = 2)
--> 232     self._auto_linear_randomforest_selector(X, y, kwargs)
    233 else:
    234     self.printer.print(f'Running feature selection with {self._model}', order = 2)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:294, in FeatureSelector._auto_linear_randomforest_selector(self, X, y, kwargs)
    291 selector_rf = self._get_selector(randomforest_model, y, kwargs)
    293 self.printer.print(f'Running feature selection with {linear_model}', order = 2)
--> 294 feats_lr_flags = self._prepare_data_apply_selector(X, y, selector_lr, scale_data = True)
    296 self.printer.print(f'Running feature selection with {randomforest_model}', order = 2)
    297 feats_rf_flags = self._prepare_data_apply_selector(X, y, selector_rf, scale_data = False)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:251, in FeatureSelector._prepare_data_apply_selector(self, X, y, selector, scale_data)
    249 X_subset, y_subset = self._subset_data(X, y)
    250 if scale_data:
--> 251     X_subset = self._scale_data(X_subset)
    252 try:
    253     X_subset, y_subset = self._transform_data_to_float_32(X_subset, y_subset)

File ~\AppData\Roaming\Python\Python310\site-packages\verstack\FeatureSelector.py:499, in FeatureSelector._scale_data(self, X)
    497 from sklearn.preprocessing import StandardScaler
    498 scaler = StandardScaler()
--> 499 X = scaler.fit_transform(X)
    500 return X

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\base.py:867, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    863 # non-optimized default implementation; override when a better
    864 # method is possible for a given clustering algorithm
    865 if y is None:
    866     # fit method of arity 1 (unsupervised transformation)
--> 867     return self.fit(X, **fit_params).transform(X)
    868 else:
    869     # fit method of arity 2 (supervised transformation)
    870     return self.fit(X, y, **fit_params).transform(X)

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\preprocessing\_data.py:809, in StandardScaler.fit(self, X, y, sample_weight)
    807 # Reset internal state before fitting
    808 self._reset()
--> 809 return self.partial_fit(X, y, sample_weight)

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\preprocessing\_data.py:844, in StandardScaler.partial_fit(self, X, y, sample_weight)
    812 """Online computation of mean and std on X for later scaling.
    813 
    814 All of X is processed as a single batch. This is intended for cases
   (...)
    841     Fitted scaler.
    842 """
    843 first_call = not hasattr(self, "n_samples_seen_")
--> 844 X = self._validate_data(
    845     X,
    846     accept_sparse=("csr", "csc"),
    847     dtype=FLOAT_DTYPES,
    848     force_all_finite="allow-nan",
    849     reset=first_call,
    850 )
    851 n_features = X.shape[1]
    853 if sample_weight is not None:

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\base.py:577, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    575     raise ValueError("Validation should be done on X, y or both.")
    576 elif not no_val_X and no_val_y:
--> 577     X = check_array(X, input_name="X", **check_params)
    578     out = X
    579 elif no_val_X and not no_val_y:

File C:\Anaconda3\envs\python_310\lib\site-packages\sklearn\utils\validation.py:856, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    854         array = array.astype(dtype, casting="unsafe", copy=False)
    855     else:
--> 856         array = np.asarray(array, order=order, dtype=dtype)
    857 except ComplexWarning as complex_warning:
    858     raise ValueError(
    859         "Complex data not supported\n{}\n".format(array)
    860     ) from complex_warning

File C:\Anaconda3\envs\python_310\lib\site-packages\pandas\core\generic.py:2070, in NDFrame.__array__(self, dtype)
   2069 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
-> 2070     return np.asarray(self._values, dtype=dtype)

ValueError: could not convert string to float: 'x'

Could you help me what am I doing wrong?

Thanks,
balgad

ps.: anyway, it's a great package! :)

LGBM Tuner - Persistent studies

Is it possible to add the possibility to make studies persistent (e.g. through optuna.create_study 'storage' parameter) ?
Thank you in advance

train test val split of 60/20/20 doesn't work while it does using train_test_split from sklearn

x_train, x_test, y_train, y_test = scsplit(x, y, stratify = y, test_size=0.4, random_state=42)
x_test, x_val, y_test, y_val = scsplit(x_test, y_test, stratify = y_test, test_size=0.5, random_state=42)

^ doesn't work

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=42)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=42)

^ works

Traceback (most recent call last):
  File "train_test_val.py", line 26, in <module>
    x_train, x_test, y_train, y_test = scsplit(x, y, stratify = y, test_size=0.4, random_state=42)
  File "/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/verstack/stratified_continuous_split.py", line 153, in scsplit
    X_t, X_v, y_t, y_v = split(X, y_binned,
  File "/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2422, in train_test_split
    n_train, n_test = _validate_shuffle_split(
  File "/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2067, in _validate_shuffle_split
    raise ValueError(
ValueError: The sum of test_size and train_size = 1.1, should be in the (0, 1) range. Reduce test_size and/or train_size.

Updating requirements

Is it possible to get the requirements.txt updated to newer version of scikit-learn, lightgbm, plotly, python-dateutils?

I have used the FeatureSelector with the manually updated versions of the above packages and it worked without any hiccup. Other DS packages usually look for comparatively newer packages and installing verstack always gives a dependency error

Can't install on Python 3.9 on a M1 Mac using Poetry

I am trying to install on a M1 Mac under Python 3.9.13 using the Poetry package manager.
Installing works when I use pip install verstack, but poetry is a very popular option nowadays, so ideally the package should be able to install.

I have a blank virtual environment:

[tool.poetry]
name = "trying-out-verstack"
version = "0.0.1"
description = "Trying out the Verstack library, a library for helping data scientists with common tasks."
authors = ["Some name <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.9.12,<3.13"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

I add verstack using the command poetry add verstack.

This is the error:

[a lot of lines left out]
  no previously-included directories found matching 'benchmarks/numpy'
  warning: no previously-included files matching '*.pyo' found anywhere in distribution
  warning: no previously-included files matching '*.pyd' found anywhere in distribution
  warning: no previously-included files matching '*.swp' found anywhere in distribution
  warning: no previously-included files matching '*.bak' found anywhere in distribution
  warning: no previously-included files matching '*~' found anywhere in distribution
  warning: no previously-included files found matching 'LICENSES_bundled.txt'
  writing manifest file 'numpy.egg-info/SOURCES.txt'
  Copying numpy.egg-info to build/bdist.macosx-13.5-arm64/wheel/numpy-1.19.5-py3.9.egg-info
  running install_scripts
  Traceback (most recent call last):
    File "/Users/allanlrh/Library/Application Support/pypoetry/venv/lib/python3.9/site-packages/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
      main()
    File "/Users/allanlrh/Library/Application Support/pypoetry/venv/lib/python3.9/site-packages/pyproject_hooks/_in_process/_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/Users/allanlrh/Library/Application Support/pypoetry/venv/lib/python3.9/site-packages/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
      return _build_backend().build_wheel(wheel_directory, config_settings,
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpt1z3iq_i/.venv/lib/python3.9/site-packages/setuptools/build_meta.py", line 211, in build_wheel
      return self._build_with_temp_dir(['bdist_wheel'], '.whl',
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpt1z3iq_i/.venv/lib/python3.9/site-packages/setuptools/build_meta.py", line 197, in _build_with_temp_dir
      self.run_setup()
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpt1z3iq_i/.venv/lib/python3.9/site-packages/setuptools/build_meta.py", line 248, in run_setup
      super(_BuildMetaLegacyBackend,
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpt1z3iq_i/.venv/lib/python3.9/site-packages/setuptools/build_meta.py", line 142, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 508, in <module>
      setup_package()
    File "setup.py", line 500, in setup_package
      setup(**metadata)
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpphzcec02/numpy-1.19.5/numpy/distutils/core.py", line 169, in setup
      return old_setup(**new_attr)
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpt1z3iq_i/.venv/lib/python3.9/site-packages/setuptools/__init__.py", line 165, in setup
      return distutils.core.setup(**attrs)
    File "/Users/allanlrh/.pyenv/versions/3.9.13/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/Users/allanlrh/.pyenv/versions/3.9.13/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/Users/allanlrh/.pyenv/versions/3.9.13/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpt1z3iq_i/.venv/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 328, in run
      impl_tag, abi_tag, plat_tag = self.get_tag()
    File "/private/var/folders/y2/cb1g865x0rz5yzq5vtqthrgh0000gn/T/tmpt1z3iq_i/.venv/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 278, in get_tag
      assert tag in supported_tags, "would build wheel with unsupported tag {}".format(tag)
  AssertionError: would build wheel with unsupported tag ('cp39', 'cp39', 'macosx_13_5_arm64')
  

  at ~/Library/Application Support/pypoetry/venv/lib/python3.9/site-packages/poetry/installation/chef.py:166 in _prepare
      162│ 
      163│                 error = ChefBuildError("\n\n".join(message_parts))
      164│ 
      165│             if error is not None:
    → 166│                 raise error from None
      167│ 
      168│             return path
      169│ 
      170│     def _prepare_sdist(self, archive: Path, destination: Path | None = None) -> Path:

Note: This error originates from the build backend, and is likely not a problem with poetry but with numpy (1.19.5) not supporting PEP 517 builds. You can verify this by running 'pip wheel --no-cache-dir --use-pep517 "numpy (==1.19.5)"'.

How to forward some fixed arguments to LGBMClassifier?

I'm using verstack.LGBMTuner to optimize params of a lightgbm.LGBMClassifier, that should resolve a multiclass classification problem.

My trouble is that the dataset is heavily imbalanced, so I need pass an argument class_weight='balanced' to LGBMClassifier's constructor.

I read the docs of LGBMTuner, but I didn't find any hint about how to forwarding parameters to a potential estimator.

Is there a way I can do it?

Fail at overstock installation: Could not build wheels for lightgbm, which is required to install pyproject.toml-based projects

error: subprocess-exited-with-error

× Building wheel for lightgbm (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [49 lines of output]
...
*** CMake build failed
[end of output]
....
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for lightgbm

Followed your installation guide and tried to install one by one with pip, still no luck.

5 fold cross validation using verstack scsplit

Hi Danil,

Thanks a lot for your great package. I used this code:

x_train, x_val, y_train, y_val = scsplit(x, y, stratify = y, test_size=0.3, random_state=42)
train = [x_train, y_train]

However, I am interested in doing a 5-fold cross validation. I was unsure how to use scsplit along with 5-fold CV method. Could you please share a code snippet that showcases this?

LGBMTuner GPU support

When training model in vanilla LGBM you could specify the target device to be GPU

model = LGBMClassifier(
    device="gpu",
)

However, when running the model with LGBMTuner I do not see such option. Upon launch of the fit() - CPU utilization spikes to 100%.. Is there any way to make the program utilize GPU? Could you document it?

Customizing search space for LGBMTuner

The default search space seems limited for several use cases. I've trained models before in which Optuna yielded num_leaves = 1700, yet Verstack seems not going above 300 for the same data.

Is it possible to customize the search space?

UFuncTypeError

I am getting the following error on using:

X_train, X_test, y_train, y_test = scsplit(X, y, stratify=y)

Inputs are all pandas dataframes.
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')

error running on m1 Mac

Running on silicon Mac I was able to install verstack, but running first example of multicore produced this error:

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

This occurred once for each of the 8 core it was trying to use

Ignore index in `scsplit`

Is it possible to have a feature to ignore index in scsplit? Currently, if the index is not ordered, an error is thrown IndexError: positional indexers are out-of-bounds. Thanks.

Documentation: problem on the first page

if we open the link https://verstack.readthedocs.io/en/latest/
we will see on the first page "veratack package contains the following tools"

Looks like a wrong name.

Possible problem in tools.py, in the computation of hours, minutes, seconds, when elapsed >= 3600

Possible problem in tools.py, in the computation of hours, minutes, seconds, when elapsed >= 3600,
I think seconds will be always equal to minutes.

I think the computation that you want is

seconds = int(elapsed % 60)
elapsed = int(elapsed // 60)
minutes = int(elapsed % 60)
hours = int(elapsed // 60)

or perhaps in the first line seconds = round(elapsed % 60, 3) instead,
if you want to make it give similar precision as in the case 60 < elapsed < 3600.

Reference or citation to be added for paper

Hi Authors/developers,

I used this wonderful package for my research. Do you have any citations/references for the package that I can add to my paper?

Thank you again.

Converting string from to UTF-8 after installation of iTerm

iconv: iconv_open(, -t): Invalid argument
Error converting string from to UTF-8

`KeyError: "['target'] not found in axis"` using FeatureSelector

When using the FeatureSelector, I get the error as the title shows. The error stack is as follows:

KeyError                                  Traceback (most recent call last)
/var/folders/0z/1wc9zs_j3655t713wzz2227w0000gp/T/ipykernel_3464/916624305.py in <cell line: 1>()
----> 1 fs.fit_transform(trainx, trainy)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/verstack/tools.py in wrapped(*args, **kwargs)
     17     def wrapped(*args, **kwargs):
     18         start = time.time()
---> 19         result = func(*args, **kwargs)
     20         end = time.time()
     21         elapsed = round(end-start,5)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/verstack/FeatureSelector.py in fit_transform(self, X, y, **kwargs)
    230         if self.auto:
    231             self.printer.print(f'Comparing LinearRegression and RandomForest for feature selection', order = 2)
--> 232             self._auto_linear_randomforest_selector(X, y, kwargs)
    233         else:
    234             self.printer.print(f'Running feature selection with {self._model}', order = 2)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/verstack/FeatureSelector.py in _auto_linear_randomforest_selector(self, X, y, kwargs)
    292 
    293         self.printer.print(f'Running feature selection with {linear_model}', order = 2)
--> 294         feats_lr_flags = self._prepare_data_apply_selector(X, y, selector_lr, scale_data = True)
    295 
    296         self.printer.print(f'Running feature selection with {randomforest_model}', order = 2)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/verstack/FeatureSelector.py in _prepare_data_apply_selector(self, X, y, selector, scale_data)
    247 
    248     def _prepare_data_apply_selector(self, X, y, selector, scale_data = False):
--> 249         X_subset, y_subset = self._subset_data(X, y)
    250         if scale_data:
    251             X_subset = self._scale_data(X_subset)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/verstack/FeatureSelector.py in _subset_data(self, X, y)
    484             experimental_data = temp.sample(frac=batch)
    485             experimental_data.reset_index(drop=True, inplace=True)
--> 486             X = experimental_data.drop('target', axis=1)
    487             y = experimental_data.target
    488             self.printer.print(f'Data decreased for experiments. Working with {np.round(batch*100,2)}% of data', order = 3)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4904                 weight  1.0     0.8
   4905         """
-> 4906         return super().drop(
   4907             labels=labels,
   4908             axis=axis,

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4148         for axis, labels in axes.items():
   4149             if labels is not None:
-> 4150                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4151 
   4152         if inplace:

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors)
   4183                 new_axis = axis.drop(labels, level=level, errors=errors)
   4184             else:
-> 4185                 new_axis = axis.drop(labels, errors=errors)
   4186             result = self.reindex(**{axis_name: new_axis})
   4187 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
   6015         if mask.any():
   6016             if errors != "ignore":
-> 6017                 raise KeyError(f"{labels[mask]} not found in axis")
   6018             indexer = indexer[~mask]
   6019         return self.delete(indexer)

KeyError: "['target'] not found in axis"

I am using v3.3.3 and the source code for FeatureSelector remains unchanged when compared to the newer version.

P.S. added full error stack and the following code was used

fs = FeatureSelector(auto=True, objective="regression")
fs.fit_transform(trainx, trainy)

CNN on surgical site infection

Hello,

I want to build a deep learning model that can predict if a surgical wound (delivery by cesarean section) is infected or not. attached is the sample images.
1.What pre trained model will you advise for transfer learning ?
2. What will be the best image format to work with between LAB,RGB an HSV

Cross validation support during HPO

I used the hyperparameter optimisation and found it really useful- Thanks. I was hoping to carry the same process out but with cross-validation (specifically groupKfold). support for scikit cv would be very useful.

ThreshTuner¶

Hello Danil,

Can you please give more clarifications on how to use ThreshTuner or maybe direct me to a working example.

from the documentation ,apart of setting the min and max thereshold,metric function,....I don't see where to put the algorithm or the data used

thresh = ThreshTuner(n_thresholds = 500, min_threshold = 0.2, max_threshold = 0.6)
thresh.fit(labels, pred, f1_score)

latest tensorflow not possible?

When I do:

pip install verstack==3.2.1

to force latest version, I get:

ERROR: Could not find a version that satisfies the requirement tensorflow==2.7.0 (from verstack) (from versions: 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1)
ERROR: No matching distribution found for tensorflow==2.7.0

Can this package not work with latest tensorflow?

Staker

I trained stacker and every thing went well but when I try to fit the final model with either layer_1_feats or both ,I get error

stacker = Stacker(objective = 'binary', auto = True)
X_train = stacker.fit_transform(X_train_un, y_train_un)
X_test = stacker.transform(X_test)

# get lists of features created in each layer
layer_1_feats = stacker.stacked_features['layer_1']
layer_2_feats = stacker.stacked_features['layer_2']

model = LogisticRegression(random_state=1)

model.fit(X_train_un[layer_2_feats], y_train_un)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-126-9363d1839018>](https://localhost:8080/#) in <module>
----> 1 model.fit(X_train_un[layer_2_feats], y_train_un)

2 frames
[/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py](https://localhost:8080/#) in _validate_read_indexer(self, key, indexer, axis)
   1372                 if use_interval_msg:
   1373                     key = list(key)
-> 1374                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())

KeyError: "None of [Index(['layer_2_0', 'layer_2_1', 'diff_layer_2_0_layer_2_1', 'layer_2_std',\n       'layer_2_mean'],\n      dtype='object')] are in the [columns]"

MeanTargetEncoder

Can you please give more clarification on how mean target encoder works
I have tried the below code:
"lost" is my target variable

from verstack import MeanTargetEncoder
mean_target_encoder = `MeanTargetEncoder(save_inverse_transform` = True)
X_train= mean_target_encoder.fit_transform(X_train, 'health_center', 'lost')
X_test = mean_target_encoder.transform(X_test)

but keep getting the below error:

AssertionError Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\verstack\categoric_encoders\args_validators.py:43, in assert_fit_transform_args(df, colname, targetname)
42 try:
---> 43 assert(targetname in df)
44 except:

AssertionError:

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Input In [458], in <cell line: 3>()
      1 from verstack import MeanTargetEncoder
      2 mean_target_encoder = MeanTargetEncoder(save_inverse_transform = True)
----> 3 X_train= mean_target_encoder.fit_transform(X_train, 'health_center', 'lost')
      4 X_test = mean_target_encoder.transform(X_test)

File ~\anaconda3\lib\site-packages\verstack\categoric_encoders\MeanTargetEncoder.py:79, in MeanTargetEncoder.fit_transform(self, df, colname, targetname)
     61 def fit_transform(self, df, colname, targetname):
     62     '''
     63     Fit encoder, transform column in df, save attributes for transform(/inverse_transform().
     64 
   (...)
     77         Data with the column transformed.
     78     '''
---> 79     assert_fit_transform_args(df, colname, targetname)
     80     from sklearn.model_selection import KFold
     82     self._colname = colname

File ~\anaconda3\lib\site-packages\verstack\categoric_encoders\args_validators.py:45, in assert_fit_transform_args(df, colname, targetname)
     43     assert(targetname in df)
     44 except:
---> 45     raise KeyError('"targetname" must a valid column name in df')

KeyError: '"targetname" must a valid column name in df'

a question about multicore

Hello, I read this article and I tried the code example(s) for iterating over a single object. I tried two different versions and the time difference is quite significant - which leads to my question, why?

Environment:

Windows 10 Pro
Python 3.7.4

code example 1 (result : 0.8 seconds)

data = range(0,1000000)

def exponential(n):
    # Real hard work here
    return n**2

def execute_func_using_verstack(func,iterable):
    from verstack import Multicore
    worker = Multicore()
    result = worker.execute(func, iterable)

if __name__ == '__main__':
    execute_func_using_verstack(exponential, data)

code example 2 (result : 2.26 seconds)

from verstack import Multicore
worker = Multicore()

data = range(0,1000000)

def exponential(n):
    # Real hard work here
    return n**2

if __name__ == '__main__':
    result = worker.execute(exponential, data)

The time difference is nearly 4 times.
Is this specific to Python 3.7.4 (on Windows)?
Or is this specific to my computer?
I'm just curious.

From what I could observe so far, it seems like the python file is executed n times (n = number of workers),
except for the execute_func_using_verstack() function and whatever's under if __name__ == '__main__'.
You can see that happening, with

data = range(0,1000000)

print('engaging')

def exponential(n):
    # Real hard work here
    return n**2

def execute_func_using_verstack(func,iterable):
    from verstack import Multicore
    print('working')
    worker = Multicore()
    result = worker.execute(func, iterable)

if __name__ == '__main__':
    print('starting')
    execute_func_using_verstack(exponential, data)

resulting in

engaging
starting
working
Multicore(workers = 8,
          multiple_iterables = False

Initializing 8 workers for exponential execution
engaging
engaging
engaging
engaging
engaging
engaging
engaging
engaging

This means that if I do all the imports at the top, it will be executed 8 times, resulting in slow time.
Is this intended behavior?

Let me know if there are other information you want me to provide.
Also, great work!

feval function for LGBMTuner

Can you add a feval function like in lightgbm.train?

It is the customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples

No such file or directory: '/Users/danil/OneDrive/...'

Hello Danil

I just did a pip install verstack and when I try to the the NaNImputer, when I get this message.
Using a Jupyter notebook.

The Factorizer file looks a little messed up to me or am I just "holding it wrong"?

FileNotFoundError                        Traceback (most recent call last)
<ipython-input-21-36b05ef79f67> in <module>
     71 
     72 
---> 73 from verstack import NaNImputer


FileNotFoundError: [Errno 2] No such file or directory: '/Users/danil/OneDrive/Bell/datasets/_traditional/house_prices (rmsle,Id,SalePrice)/house_prices_train.csv'

Package does not work with M1 Mac because of Tensorflow version

Verstack currently only supports TensorFlow 2.11.0 but support for M1 Macs was added to TensorFlow 2.13.0

KeyError in Stacker

This is the code:

from verstack import Stacker

stacker = Stacker(objective = 'regression', auto = True)
X_train = stacker.fit_transform(X_train, y_train)
X_val = stacker.transform(X_val)
df1 = stacker.transform(df1)

get lists of features created in each layer

layer_1_feats = stacker.stacked_features['layer_1']
layer_2_feats = stacker.stacked_features['layer_2']

model = LGBMRegressor(random_state=1)

use only the second layer outputs as inputs in to the final meta_model

model.fit(X_train[layer_2_feats], y_train)
pred = model.predict(df1[layer_2_feats])

And below is the error I am getting:

Initiating Stacker.fit_transform
- Training/predicting with layer_1 models
  . Optimising model hyperparameters

KeyError Traceback (most recent call last)
File :13, in

File ~\anaconda3\lib\site-packages\verstack\tools.py:19, in timer..wrapped(*args, **kwargs)
16 @wraps(func)
17 def wrapped(*args, **kwargs):
18 start = time.time()
---> 19 result = func(*args, **kwargs)
20 end = time.time()
21 elapsed = round(end-start,5)

File ~\anaconda3\lib\site-packages\verstack\stacking\Stacker.py:615, in Stacker.fit_transform(self, X, y)
613 validate_fit_transform_args(X, y)
614 X_with_stacked_feats = X.reset_index(drop=True).copy()
--> 615 X_with_stacked_feats = self._apply_all_or_extra_layers_to_train(X_with_stacked_feats, y)
616 return X_with_stacked_feats

File ~\anaconda3\lib\site-packages\verstack\stacking\Stacker.py:574, in Stacker._apply_all_or_extra_layers_to_train(self, X, y)
572 if layers_added_after_fit_transform:
573 for layer in layers_added_after_fit_transform:
--> 574 X = self._apply_single_layer(layer, X, y)
575 else:
576 # if no extra layers apply all layers on train set
577 X = self._apply_all_layers(X, y)

File ~\anaconda3\lib\site-packages\verstack\stacking\Stacker.py:510, in Stacker._apply_single_layer(self, layer, X, y)
506 new_feats = self._create_new_feats_in_test(X, y, layer, applicable_feats)
507 # ---------------------------------------------------------------------
508 # create stacked feats in train set
509 else:
--> 510 new_feats = self._create_new_feats_in_train(X, y, layer, applicable_feats)
511 for feat in new_feats:
512 X = pd.concat([X, feat], axis = 1)

File ~\anaconda3\lib\site-packages\verstack\stacking\Stacker.py:478, in Stacker._create_new_feats_in_train(self, X, y, layer, applicable_feats)
476 for model in self.layers[layer]:
477 feat_name = self._create_feat_name(layer)
--> 478 new_feat = self._get_stack_feat(model, X[applicable_feats], y)
479 # append trained models from buffer to self.trained_models_list for layer/feature
480 self.trained_models[layer][feat_name] = self._trained_models_list_buffer

File ~\anaconda3\lib\site-packages\verstack\stacking\Stacker.py:316, in Stacker._get_stack_feat(self, model, X, y)
314 '''Apply stacking features creatin to either train or test set'''
315 if isinstance(y, pd.Series):
--> 316 new_feat = self._train_predict_by_model(model, X, y)
317 else:
318 new_feat = self._predict_by_model(model, X)

File ~\anaconda3\lib\site-packages\verstack\stacking\Stacker.py:300, in Stacker._train_predict_by_model(self, model, X, y)
298 for train_ix, test_ix in kfold.split(X,y):
299 X_train = X.loc[train_ix, :]
--> 300 y_train = y.loc[train_ix]
301 X_test = X.loc[test_ix, :]
302 # create independent model instance for each fold

File ~\anaconda3\lib\site-packages\pandas\core\indexing.py:967, in _LocationIndexer.getitem(self, key)
964 axis = self.axis or 0
966 maybe_callable = com.apply_if_callable(key, self.obj)
--> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~\anaconda3\lib\site-packages\pandas\core\indexing.py:1191, in _LocIndexer._getitem_axis(self, key, axis)
1188 if hasattr(key, "ndim") and key.ndim > 1:
1189 raise ValueError("Cannot index with multidimensional key")
-> 1191 return self._getitem_iterable(key, axis=axis)
1193 # nested tuple slicing
1194 if is_nested_tuple(key, labels):

File ~\anaconda3\lib\site-packages\pandas\core\indexing.py:1132, in _LocIndexer._getitem_iterable(self, key, axis)
1129 self._validate_key(key, axis)
1131 # A collection of keys
-> 1132 keyarr, indexer = self._get_listlike_indexer(key, axis)
1133 return self.obj._reindex_with_indexers(
1134 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
1135 )

File ~\anaconda3\lib\site-packages\pandas\core\indexing.py:1327, in _LocIndexer._get_listlike_indexer(self, key, axis)
1324 ax = self.obj._get_axis(axis)
1325 axis_name = self.obj._get_axis_name(axis)
-> 1327 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
1329 return keyarr, indexer

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5782, in Index._get_indexer_strict(self, key, axis_name)
5779 else:
5780 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5782 self._raise_if_missing(keyarr, indexer, axis_name)
5784 keyarr = self.take(indexer)
5785 if isinstance(key, Index):
5786 # GH 42790 - Preserve name from an Index

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5845, in Index._raise_if_missing(self, key, indexer, axis_name)
5842 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 5845 raise KeyError(f"{not_found} not in index")

KeyError: '[3, 10, 16, 20, 21, 23, 25, 35, 38, 40, 41, 43, 47, 48, 60, 61, 65, 74, 77, 85, 86, 93, 98, 100, 110, 113, 121, 126, 128, 129, 133, 134, 135, 136, 142, 143, 149, 150, 151, 158, 159, 162, 165, 169, 172, 175, 177, 183, 188, 195, 203, 204, 219, 221, 223, 228, 229, 231, 236, 249, 264, 271, 274, 275, 283, 290, 296, 303, 304, 308, 311, 319, 323, 330, 331, 332, 338, 343, 354, 358, 359, 363, 366, 372, 379, 383, 385, 386, 388, 394, 399, 416, 419, 421, 424, 426, 436, 441, 442, 445, 449, 462, 468, 471, 474, 480, 489, 501, 507, 510, 530, 542, 547, 551, 555, 557, 559, 567, 572, 576, 582, 584, 585, 588, 609, 610, 611, 613, 615, 618, 619, 620, 622, 625, 627, 633, 641, 642, 660, 663, 668, 669, 670, 671, 674, 675, 690, 692, 693, 695, 703, 706, 710, 714, 716, 726, 730, 733, 739, 742, 749, 755, 756, 757, 761, 762, 763, 765, 768, 771, 772, 784, 793, 804, 806, 815, 821, 824, 831, 834, 835, 836, 848, 858, 861, 871, 874, 875, 878, 880, 884, 885, 892, 895, 896, 897, 904, 909, 912, 917, 921, 923, 925, 926, 928, 931, 932, 937, 938, 948, 950, 952, 962, 964, 974, 984, 987, 991, 996, 998, 1003, 1007, 1011, 1012, 1015, 1026, 1031, 1032, 1033, 1035, 1038, 1040, 1056, 1068, 1070, 1077, 1078, 1082, 1083, 1084, 1085, 1105, 1110, 1112, 1115, 1121, 1126, 1130, 1136, 1137, 1141, 1151, 1154, 1156, 1172, 1175, 1179, 1184, 1185, 1187, 1189, 1192, 1211, 1217, 1218, 1225, 1229, 1231, 1232, 1249, 1256, 1268, 1278, 1280, 1281, 1286, 1287, 1293, 1295, 1298, 1309, 1310, 1311, 1313, 1318, 1323, 1326, 1331, 1332, 1336, 1339, 1344, 1345, 1347, 1348, 1357, 1364, 1365, 1366, 1369, 1372, 1379, 1380, 1383, 1385, 1388, 1392, 1396, 1402, 1404, 1415, 1427, 1429, 1437, 1440, 1441, 1452, 1453, 1457, 1458, 1459, 1463, 1464, 1472, 1473, 1474, 1479, 1482, 1486, 1499, 1507, 1517, 1518, 1521, 1535, 1537, 1539, 1541, 1542, 1544, 1545, 1548, 1551, 1554, 1555, 1556, 1557, 1570, 1577, 1578, 1579, 1587, 1590, 1610, 1613, 1617, 1618, 1624, 1633, 1634, 1635, 1636, 1639, 1643, 1651, 1657, 1673, 1680, 1687, 1692, 1697, 1723, 1732, 1735, 1736, 1737, 1747, 1750, 1753, 1756, 1757, 1762, 1765, 1772, 1774, 1783, 1788, 1794, 1800, 1815, 1824, 1836, 1843, 1844, 1847, 1849, 1850, 1863, 1870, 1877, 1886, 1888, 1891, 1894, 1897, 1899, 1904, 1910, 1914, 1918, 1923, 1925, 1926, 1935, 1938, 1940, 1949, 1951, 1956, 1959, 1961, 1966, 1972, 1973, 1979, 1987, 1994, 1997, 2002, 2014, 2017, 2018, 2023, 2028, 2032, 2039, 2044, 2045, 2048, 2050, 2072, 2077, 2079, 2081, 2089, 2092, 2101, 2105, 2107, 2108, 2113, 2115, 2116, 2118, 2119, 2125, 2135, 2141, 2142, 2143, 2144, 2147, 2149, 2151, 2160, 2163, 2166, 2167, 2174, 2175, 2180, 2181, 2184, 2195, 2198, 2200, 2207, 2215, 2218, 2221, 2222, 2224, 2236, 2239, 2243, 2245, 2248, 2254, 2256, 2260, 2264, 2268, 2270, 2274, 2275, 2283, 2288, 2289, 2291, 2292, 2297, 2305, 2309, 2317, 2323, 2324, 2327, 2329, 2330, 2333, 2335, 2336, 2338, 2349, 2360, 2361, 2363, 2370, 2373, 2379, 2384, 2388, 2393, 2399, 2403, 2405, 2408, 2411, 2413, 2416, 2419, 2421, 2429, 2435, 2437, 2439, 2443, 2444, 2446, 2449, 2461, 2470, 2473, 2478, 2484, 2485, 2492, 2503, 2504, 2505, 2506, 2507, 2515, 2518, 2531, 2538, 2544, 2545, 2548, 2557, 2560, 2564, 2568, 2569, 2573, 2578, 2583, 2587, 2596, 2623, 2626, 2630, 2633, 2647, 2652, 2659, 2662, 2664, 2670, 2678, 2679, 2682, 2683, 2686, 2689, 2690, 2701, 2702, 2704, 2705, 2708, 2723, 2726, 2732, 2734, 2746, 2750, 2752, 2758, 2761, 2767, 2769, 2770, 2777, 2779, 2781, 2785, 2788, 2790, 2793, 2794, 2800, 2803, 2806, 2807, 2810, 2814, 2830, 2832, 2840, 2850, 2859, 2861, 2862, 2867, 2879, 2882, 2887, 2899, 2901, 2904, 2906, 2911, 2920, 2922, 2924, 2927, 2928, 2929, 2934, 2936, 2941, 2944, 2972, 2977, 2979, 2984, 2985, 2986, 2990, 2991, 2995, 3005, 3008, 3013, 3022, 3028, 3031, 3037, 3038, 3039, 3044, 3046, 3048, 3051, 3056, 3059, 3062, 3063, 3064, 3070, 3074, 3076, 3078, 3079, 3083, 3091, 3094, 3096, 3098, 3112, 3120, 3124, 3126, 3133, 3134, 3135, 3145, 3146, 3159, 3160, 3165, 3172, 3183, 3184, 3186, 3188, 3189, 3190, 3203, 3205, 3211, 3214, 3229, 3236, 3241, 3253, 3255, 3260, 3273, 3276, 3280, 3283, 3284, 3289, 3292, 3293, 3294, 3301, 3302, 3306, 3312, 3313, 3317, 3326, 3327, 3328, 3339, 3340, 3346, 3350, 3353, 3355, 3356, 3358, 3372, 3373, 3374, 3375, 3380, 3381, 3382, 3384, 3395, 3397, 3405, 3406, 3407, 3409, 3416, 3418, 3420, 3422, 3423, 3424, 3425, 3427, 3432, 3433, 3438, 3443, 3453, 3455, 3456, 3467, 3469, 3470, 3471, 3472, 3474, 3475, 3476, 3494, 3508, 3516, 3517, 3518, 3520, 3522, 3527, 3532, 3534, 3540, 3544, 3546, 3549, 3550, 3551, 3553, 3565, 3570, 3571, 3577, 3579, 3580, 3583, 3584, 3599, 3602, 3611, 3614, 3615, 3625, 3628, 3637, 3642, 3643, 3646, 3650, 3651, 3664, 3677, 3683, 3698, 3705, 3710, 3711, 3714, 3722, 3728, 3729, 3734 ] not in index'

Add cross validation to LGBMTuner

Some sort of LGBMTunerCV being able to customise the number of CVs would be great.

danilzherebtsov / verstack Goto Github PK

verstack's People

Contributors

Stargazers

Watchers

Forkers

verstack's Issues

get lists of features created in each layer

use only the second layer outputs as inputs in to the final meta_model

Recommend Projects

Recommend Topics

Recommend Org

Jobs