ealcobaca / pymfe Goto Github PK

View Code? Open in Web Editor NEW

123.0 123.0 28.0 1.84 MB

Python Meta-Feature Extractor package.

Home Page: https://pymfe.readthedocs.io

License: MIT License

Python 99.71% Makefile 0.29%

automl machine-learning meta-feature meta-features meta-learning metalearning

pymfe's Introduction

Hi there! I am Edesio

I'm a Data Scientist from Brazil. 🇧🇷

I have been researching and developing tools for machine learning and AutoML. 🤓

I am currently a Ph.D. candidate at the University of São Paulo. 👨‍🎓

Together with my advisor, we are developing a project to automate end-to-end machine learning pipelines. This journey has been very challenging and fruitful! My research interests are machine learning, AutoML, meta-learning, optimization for machine learning, computational mathematics, and bioinformatics. 👨‍🔬

I like to build things with machine learning, distributed systems, Python, SQL, modern backend/frontend frameworks. 👨‍💻

How to reach me: 📫

My website: ealcobaca.github.io
GitHub as @ealcobaca (you are here)
LinkedIn

May the force be with you!

pymfe's People

Contributors

Stargazers

Watchers

pymfe's Issues

[BUG] nan value when complexity measure is calculated

Describe the bug
Hello. It's the first time to write 'Issues' tab. (I'm a fresh github user)
If I'm not proper in github format and English writing, please understand.

The problem is,
when I calculate complexity measure, I usually get 'nan' value randomly. I have been through a few weeks, and read the code with some papers, but I can't guess why.
I replace the nan value with 0, but I'm worried there's no exact logic for me.
what is the difference between nan and 0.0? Do you think is it okay to this replacement(nan to 0)?

Could you give me any advice?
Thank you.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Thank you for creating pymfe. It is big helpful to my first research.

Performance of Info-theoretic extraction

Hello.
I've been using PyMFE to extract metafeatures(MtF) when faced an unexpected (to me at least) execution time.
Specifically, the time spent for extracting all information theoretic was quite large, exceeding by far the total amount
of all the other groups.
The extraction process I performed was simple: to extract MtF from each group separately using a sliding windows w = 300 with step of 10. This process was repeated 2500 for each MtF group. The total execution time is show on the table below:

Duration	Group
34.5s	concept
24.1s	itemset
6.1min	complexity
34.1s	clustering
3.5min	relative
42.4min	info-theory
15.9s	model-based
43.3s	statistical
3.5min	landmarking

The database used was elec2.
I would like to know possible causes of such huge difference on extraction time or whether this is the expected behavior.
NOTE: each time involver other computations such as model training and evaluation. However, the code is the same for all experiments, the only difference being the MtF group employed for characterizing the windows.

Check strange behavior

Check the behavior of the pymfe package under the following datasets:

1056_mc1.arff
1475_first-order-theorem-proving.arff
1487_ozone-level-8hr.arff
40705_tokyo1.arff
761_cpu_act.arff

[BUG] Running out of memory

Describe the bug
I am trying to fit a [414 x 514] dataset with no y labels, but even on google colab with high ram mode, the RAM is getting full and it is crashing.
Is it possible to fit it using batches like using DataGenrator and fit_generator() function we use in training Neural Networks or any other possible solution?

Thanks in advance.

[BUG] Online documentation example seems incorrect

Describe the bug
The following online documentation content seems to be in the incorrect URL:
https://pymfe.readthedocs.io/en/latest/auto_examples/03_miscellaneous_examples/plot_default_value_for_attr_conc.html

To Reproduce
It is an online documentation problem.

Expected behavior
The content of the given URL looks in the wrong place, and also the example is lacking.

Screenshots
None.

Desktop (please complete the following information):
None.

Additional context
None.

New pymfe version - - 0.4

Organizing the new version of pymfe.

Method for extracting meta-feature names before extraction

Is your feature request related to a problem? Please describe.
Sometimes it is convenient to extract the list of meta-feature names before the meta-feature extraction.

#An example to motivate the requested feature
from pymfe.mfe import MFE
import pandas as pd

number_of_datasets = 64
extractor = MFE(features="all", summary=["mean", "histogram", "max"])

# Init meta-dataset (but we need the meta-feature names...)
X, y = get_dataset_from_id(0)
mtf_names, mtf_values_dt_0 = extractor.fit(X, y).extract()  # Ugly, inconvenient...
meta_dataset = pd.DataFrame(index=range(number_of_datasets), columns=len(mtf_names))

meta_dataset.loc[0, :] = mtf_values_dt_0  # Not visually satisfying...
for i in range(1, number_of_datasets):
    meta_dataset.loc[i, :] = extractor.fit(X, y).extract()[1]  # Repeated line of code...

For instance, if the goal is to construct a meta-dataset using a pandas dataframe, it is convenient to get the meta-feature names in order to correctly set the data frame column names even before the meta-feature extraction starts. If the meta-dataset is simply a numpy array, then the number of extracted meta-features is sufficient to correctly initiate the number of columns in the array, which also is a statistic that can be naturally extracted from the list of meta-feature names (e.g. len(meta_feat_names)). Since in this case we're iterating over various base datasets in order to fill that meta-dataset structure, it is highly inconvenient to make a manual extraction before the iterating process starts (see example above) in order to just prepare the meta-dataset structure.

See section below for concrete examples.

Describe the solution you'd like
I propose a method '.extract_metafeature_names()' that works even before any data is fit, using only the information got while instantiating a MFE model.

Concrete usage examples would be:

# A pandas data-frame example
from pymfe.mfe import MFE
import pandas as pd

number_of_datasets = 64
extractor = MFE(features="all", summary=["mean", "histogram", "max"])
mtf_names = extractor.extract_metafeature_names()  # To be implemented...

# Init meta-dataset with the correct number of columns and column names
meta_dataset = pd.DataFrame(index=range(number_of_datasets), columns=mtf_names)

for i in range(number_of_datasets):
    X, y = get_dataset_from_id(i)
    meta_dataset.loc[i, :] = extractor.fit(X, y).extract()[1]

# A numpy array example
from pymfe.mfe import MFE
import numpy as np

number_of_datasets = 64
extractor = MFE(features="all", summary=["mean", "histogram", "max"])
mtf_names = extractor.extract_metafeature_names()  # To be implemented...

# Init meta-dataset with the correct number of columns
meta_dataset = np.full((number_of_datasets, len(mtf_names)), fill_value=np.nan)

for i in range(number_of_datasets):
    X, y = get_dataset_from_id(i)
    meta_dataset[i, :] = extractor.fit(X, y).extract()[1]

Describe alternatives you've considered
None.

Additional context
None.

General + Statistical Metafeatures

Describe the bug
General
1st. when running attr_to_inst I get a message "It is not possible to make equal discretization,Invalid value encountered in greater.
Statistical
1st.Warning: Can't extract feature 'nr_disc'.
Exception message: ValueError("Input contains NaN, infinity or a value too large for dtype('float64').").
Will set it as 'np.nan' for all summary functions.
Does that mean that my dataset has missing values?

Expected behavior
Expecting a large number of numerical and a small number of binary or categorical.

OS: Win10

Disclaimer
I am very new to data science and I am only on my third year of university,please excuse me if the explanation is not sufficient.

New summary functions

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Given an array of real values X = [x_{1} ... x_{n}]^T, I'm proposing six new summary functions:

Power sum: sum of all values in a array to a fixed power 'p': power_sum(X, p: t.Union[int, float]) = sum(X ** p)
P-norm (Minkowski norm): (power_sum(abs(X), p))^(1/p)
Sum: sum of all elements in X. Note that this is just a special case of 'power_sum(X, 1)' but, if implemented separately, it should run faster.

The other three summary functions are the variants of the three summary functions proposed above, but ignoring 'nan' values.

Describe alternatives you've considered
Implement the two summary functions just as regular summary functions.

The power sum may receive both a single power 'p' or a sequence P of powers, power_sum(x, p: t.Union[float, int, t.Sequence[t.Union[float, int]]]). In the first case, the summary function will return a single scalar value. In the second scenario, the summary function will return an array of scalars (similar to summary functions such as the 'histogram'), such that each value is corresponding to a power 'p_{i}' in the sequence P.

The p-norm summary function is defined by the return value of the power sum summary function to the power (1/p), if p is a scalar. Otherwise, it is defined as the return value of the power sum summary function with each value to the power p_{i}^{-1}, i=1, ..., n, such that 'i' is the index of each value.

The summation summary function 'sum' is appropriate to summarize features based on counting (p.e., number of elements in X larger than the median value). Such feature methods could just employ the sum itself before returning the result, returning an integer instead of the full array. However, such approach prevent the use of other summary functions onto the same return value, and those additional summary functions could be potentially as descriptive as the sum itself.

Additional context
In works like [1] and [2] I have seen summary-analogous operations of sum of squares to combine multiple values of the same feature to generate the final meta-feature. Here, I'm proposing the generalized operation of 'power_sum(X, p)', which has the special case of 'sum_of_squares(X) = power_sum(X, p=2)'. The 'p-norm' operation is a single-step operation after the 'power_sum' operation, and is also proposed for the sake of completeness.

[1]: T.S. Talagala, R.J. Hyndman and G. Athanasopoulos. Meta-learning how to forecast time series (2018).
[2]: 'tsfeatures' package (https://github.com/robjhyndman/tsfeatures)

Support for regression tasks or issue an error while fitting real-valued 'y'

Is your feature request related to a problem? Please describe.
Support for regression tasks (real valued 'y'), or at least issue an error while fitting real-valued 'y'.

Describe the solution you'd like
Inspired by #92, I would like to request a solution for the scenario where the user fits a real-valued 'y'. The main solution is, of course, correctly support these type of data using some proposed solution available in the scientific literature.

Simple solutions, like the 'y' discretization, can be possible.

Describe alternatives you've considered
An alternative temporary solution would be to raise an error when the user fits a real-valued y' (pretty much like the Sklearn Classifiers do). That can potentially break existing code using the Pymfe though.

Additional context
None.

Hello, for the output meta-features' names, could you please provide the detail meaning of each feature?

Hello, for the output meta-features' names, could you please provide the detail meaning of each feature?
For example, for the meta-feature 'attr_conc.mean' , I cannot figure out the what 'conc' stands for from all the docs.

Thank you very much.

ValueError: bins must be monotonically increasing or decreasing

Hi, i want to know that what is the problem with data while raising "ValueError: bins must be monotonically increasing or decreasing" ? Is there any constrains of input data types?

Online example for the method to extract meta-feature names before any extraction

Is your feature request related to a problem? Please describe.
No. It is a documentation enhancement.

Describe the solution you'd like
Create an example for the feature developed in issue #105 to improve the online documentation.

Describe alternatives you've considered
I consider adding a simple example in the fold examples/03_miscellaneous_examples that take into account before/after fit scenarios.

Additional context
It is the continuation of issue #105.

Exception on data with missing values

Describe the bug
pymfe seems to be unable to handle missing data. I have a dataframe with 2480 rows containing a missing value (for details see here). In MFE._set_data_numeric the categorical_dummies seem to drop the the instances with missing values while the data_num array still has the correct number of rows.

categorical_dummies.shape <class 'tuple'>: (5644, 95)
data_num.shape <class 'tuple'>: (8124, 0)

Stacktrace:

  File "/mnt/c/local/phd/code/meta-learning-base/metafeatures.py", line 211, in calculate
    mfe.fit(X.to_numpy(), y.to_numpy())
  File "/usr/local/lib/python3.6/dist-packages/pymfe/mfe.py", line 915, in fit
    rescale_args=rescale_args)
  File "/usr/local/lib/python3.6/dist-packages/pymfe/mfe.py", line 783, in _set_data_numeric
    axis=1).astype(float)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

Code to reproduce:

import openml
from pymfe.mfe import MFE


ds = openml.datasets.get_dataset(24)
X, y, categorical_indicator, attribute_names = ds.get_data(
    dataset_format='dataframe',
    target=ds.default_target_attribute
)
    
mfe = MFE(features=(['nr_inst', 'nr_attr', 'nr_class', 'nr_outliers', 'skewness', 'kurtosis', 'cor', 'cov',
                     'sparsity', 'gravity', 'var', 'class_ent', 'attr_ent', 'mut_inf',
                     'eq_num_attr', 'ns_ratio', 'nodes', 'leaves', 'leaves_branch', 'nodes_per_attr',
                     'var_importance', 'one_nn', 'best_node', 'linear_discr',
                     'naive_bayes', 'leaves_per_class']))

mfe.fit(X.to_numpy(), y.to_numpy())

Expected behavior
I am not sure about the expected behaviour. I assume at least some meta-features can not be computet with missing values. At least I would expect the program does not crash and ignores the uncalculatable meta-features.

Desktop (please complete the following information):
Linux-4.4.0-18362-Microsoft-x86_64-with-Ubuntu-18.04-bionic
python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
numPy 1.16.2
sciPy 1.2.1
scikit-Learn 0.21.3
pandas 0.24.2
patsy 0.24.2
pymfe 0.2.0

Target variable

Forgive me if I am wrong, but, does pymfe differentiate between these target variables:

y = [0,0,0,1,1,1,2,2,2]
y = [class_1, class_1, class_2, class_3]

If not, then this requires data preprocessing either to be automatic or user warning.

[BUG] Many Feature-extraction improvement

Good morning.

In an investigation project, we are developing an approach to predict the error of future models in an streaming data scenario, in wich data is allways coming and we need to decide when to retrain a model to not become obsolete.

For that purpose, we are extracting as many features we can from pymfe package (thanks so much for your work and continue doing such a beautifull work), and we come with an idea to use as many datasets we also can for the meta-feature extraction in order to maybe predict the error of any kind of dataset, instead of only one dataset containing only one problem/domain.

So, in the present work we are investigating this possibility with regression models for now, but we are asking for help in the sense that we would like to extract as many meta-features we can (we are aware of dimensionality problem and we apply strategies latter) and ignore the ones that can't be extracted like Fig.a, in some automated way instead of selecting some specific meta-features or groups, if that would be possible.

For example to avoid the Fig.b error in the extraction of the appended dataset, with mfe = MFE(groups="all", summary="all")

Fig.a

Fig.b

Dataset
2019.zip

Thanks again for your excelent work!

[BUG] eigenvalue.mean outputing complex numbers

Describe the bug
When extracting the eigenvalue feature the mean summarization outputs a complex number, like (0.433456+0j), this later causes errors with lightGBM.

To Reproduce
Steps to reproduce the behavior:

Get the "elec2" dataset.
Take the first 300 examples and extract the eigenvalue feature with mean and sd as summarization.
Look at the extracted feature.

Expected behavior
As described by the theory it should output only real values, not complex ones.

Screenshots

Desktop (please complete the following information):

OS: Ubuntu 18.04.4
Version 0.4

Get all Metafeatures for unsupervised tasks

Hi, I have read the paper "Rivolli, A., Garcia, L. P. F., Soares, C., Vanschoren, J., and de Carvalho, A. C. P. L. F. (2018). Towards Reproducible Empirical Research in Meta-Learning. arXiv:1808.10406." and found this python package. I am now trying to use the meta feature extractor (MFE) for clustering tasks. This means I do not have classification labels. At first sight, this is not a problem since I can just use "dummy" labels and put them in the meta-feature extractor. Then I just have to get all metafeatures that belong to "any" task. But here comes the problem.

First I used the "features" argument in the MetafeatureExtractor to extract all metafeatures which can be used for task "any" by their names. For this, I used the above-mentioned paper since there are the acronyms as well as the task (Classification, any,...) is mentioned. But when doing this I got a lot of warnings that there are unknown metafeatures like "attrToInst" or "nrNUM".

I also tried to extract all metafeatures, then take the result dictionary, copy and paste the name from the result to be sure I have not done any typo. But still I get warning that there are unknown metafeatures, which is really strange (and maybe a bug?) since they were extracted by the MFE.

Warning: Unknown feature "nrcorattr"
Warning: Unknown feature "nrinst"
Warning: Unknown feature "nrnorm"
Warning: Unknown feature "iqrange"
Warning: Unknown feature "nrattr"
Warning: Unknown feature "cattonum"
Warning: Unknown feature "numtocat"
Warning: Unknown feature "insttoattr"
Warning: Unknown feature "gmean"
Warning: Unknown feature "tmean"

So, is there an easy way to get ALL metafeatures that are mentioned in the paper which are only for any tasks?

New canonical correlation-based metafeatures

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
The following new statistical-group meta-features could be easily implemented using the canonical correlation values:

Pillai’s trace – Pillai’s trace is the sum of the squared canonical correlations.
Lawley-Hotelling trace – It is the sum of the values of (canonical correlation^2/(1-canonical correlation^2)).
Roy’s largest root – This is the square of the largest canonical correlation.

Describe alternatives you've considered
I did not implemented these meta-features yet to enforce consistence with the R MFE version.

Additional context

Quick additional information about the requested meta-features: https://stats.idre.ucla.edu/stata/output/canonical-correlation-analysis/
The actual implementation needs complete and reliable references.

Complexity measure meta-features

Some papers are showing that complexity measures could be good meta-features. However, some of these measures are computationally expensive, and It can make a meta-learning system ineffective.

We would like to add these meta-features. Nonetheless, one needs to do a careful study to implement the most interesting for meta-learning.

@lpfgarcia, could you help us?

Add the link for documentation more clearly in the README.md

Is your feature request related to a problem? Please describe.
We need to add the link for documentation more clearly in the README.md,

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Support for unsupervised tasks

Is your feature request related to a problem? Please describe.
Some metalearning tasks use unsupervised metafeatures (no given y), ex. metastream where in datastreams the target value is not know in some point of time.

Describe the solution you'd like
In the MFE().fit(x, y) method, the default value for y could be None so when extracting metafeatures it checks the y to perform a supervised extraction or not. Like scikit-learn implements.

Describe alternatives you've considered
The above one is the more concise, an alternative is create another method for extract usupervised features.

Additional context
An example in sklearn

Progress feedback

I'm currently trying out pymfe on a dataset that contains about 43 time series, each consisting of about 1000 values.

The execution of MFE.extract() is taking an incredible amount of time.
It would be nice to have some kind of progress indication given back to the user (maybe look at how tsfresh does it)

Organize directory structure

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Separation of concepts reflected on files.

Describe alternatives you've considered
Keep unchanged.

Additional context
None

Output structure

Is your feature request related to a problem? Please describe.
After extracting meta-features the output is two tuples. Maybe output as a dictionary? Would be easier to convert to dataframes later on.

Describe the solution you'd like
Doing something like this dict((x, y) for x, y in zip(ft[0], ft[1])) makes the output look cleaner and easier to read.

Enable parallel computation

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Enable option for parallel meta-feature extraction.

Describe alternatives you've considered
None.

Additional context
None.

Add meta-features from OpenML and metalearn

This project looks simple and promising. An idea to extend and add more meta-features that have become popular on OpenML and metalearn.

MemoryError when use a large dataset [BUG]

Describe the bug
Hi! I wanted to use pymfe for the characterization of my data, but there is a memory error when using a data set of 20k samples or more.

To Reproduce
I used the following code:

data = pd.read_csv('train.csv')
data.dropna(inplace=True)
y_target = data['target']
data = data.drop(['target'], axis=1)
y_target = np.array(y_target)
data = np.array(data)
mfe = MFE(features=["nr_class", "nr_num", "nr_cat", "nr_bin", "kurtosis", "nr_outliers", "iq_range", "skewness", "sparsity", "class_ent", "attr_ent", "mut_inf", "ns_ratio", "eq_num_attr"])
mfe.fit(data, y_target)

Running the fit method, the memory error is returned.

Expected behavior
It was supposed to have returned the results of the features that were selected in the pymfe.

Screenshots
Screenshot of the error:

Desktop (please complete the following information):
I tried to use Windows 10 (8GB RAM) and Linux Mint (16GB RAM). I also tried to develop on Google Colab (13GB RAM).

Additional context
I tested the same code with three datasets, but the problem only appeared in the latter, which had a much larger number of samples. Using the read_csv parameter "nrows" I was testing to see the limit that error appeared, from "nrows = 20000" it started to give a memory error.
I thought that the error could be due to the number of features that I placed, but even placing a single feature the error occurs.

Considering kaggle datasets or used in companies, 20k of samples is relatively little for this type of error to occur. Any suggest?

[BUG] `suppress_warnings` not suppressing all warnings

Describe the bug
Using the suppress_warnings flag does not always suppress all warnings especially the following:

* Warning: It is not possible make equal discretization
* Warning: divide by zero encountered in true_divide
* Warning: Mean of empty slice.
* Warning: invalid value encountered in double_scalars
* Warning: invalid value encountered in greater_equal

To Reproduce
None.

Expected behavior
I would assume it would suppress everything unless there is a reasoning behind this?

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: macOS 10.15.6

Additional context
None.

numpy.linalg.LinAlgError: SVD did not converge

Hi :)

I keep getting this bug with regression datasets. With classification datasets, pymfe works fine but with regression it gives me this error (before I was able to use pymfe with regression datasets):

 mfe.fit(X.values, y.values)
  File "/home/fj-silva/.local/lib/python3.6/site-packages/pymfe/mfe.py", line 1034, in fit
    **{**self._custom_args_ft, **kwargs})
  File "/home/fj-silva/.local/lib/python3.6/site-packages/pymfe/_internal.py", line 1209, in process_precomp_groups
    new_precomp_vals = precomp_mtd_callable(**kwargs)  # type: ignore
  File "/home/fj-silva/.local/lib/python3.6/site-packages/pymfe/statistical.py", line 128, in precompute_can_cors
    can_cors = cls._calc_can_cors(N=N, y=y)
  File "/home/fj-silva/.local/lib/python3.6/site-packages/pymfe/statistical.py", line 233, in _calc_can_cors
    n_components=n_components).fit_transform(N, y_bin)
  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/sklearn/cross_decomposition/_pls.py", line 517, in fit_transform
    return self.fit(X, y).transform(X, y)
  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/sklearn/cross_decomposition/_pls.py", line 333, in fit
    tol=self.tol, norm_y_weights=self.norm_y_weights)
  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/sklearn/cross_decomposition/_pls.py", line 80, in _nipals_twoblocks_inner_loop
    Y_pinv = pinv2(Y, check_finite=False, cond=cond_Y)
  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/scipy/linalg/basic.py", line 1374, in pinv2
    u, s, vh = decomp_svd.svd(a, full_matrices=False, check_finite=False)
  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/scipy/linalg/decomp_svd.py", line 132, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

I am trying to extract these features:

features = ['nr_attr', 'nr_inst', 'nr_class', 'attr_to_inst', 'inst_to_attr', 'freq_class', 'nr_cor_attr',
                    'iq_range', 'kurtosis', 'max', 'min', 'var', 'cov', 'eigenvalues', 'skewness',
                    'joint_ent', 'mut_inf', 'eq_num_attr']

And I call mfe like this:

     mfe = MFE(features=features)
      mfe.fit(X.values, y.values)  
      ft = mfe.extract()

I already checked if the dataset contains nan or inf. And I already used StandardScaler to scale the data before fitting mfe. None of those worked

You can reproduce the error with this dataset from openml => https://www.openml.org/d/574
The error seems to happen in cross_decomposition.CCA. Is there any way that I can increase the max_iter parameter?

Thanks :)

PS: Vi que os contribuidores são do Brazil. Eu sou de Portugal, por isso se quiserem falar em portugues não há problema :)

[BUG] AttributeError in igraph.GraphBase.Weighted_Adjacency

Describe the bug
Oi, eu tentei rodar o código em um Linux e ouve problemas para importar a biblioteca, de acordo com a imagem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Non-deterministic behavior in the meta-features extraction.

Some times the final meta-feature vector is given with meta-features summary flipped.

Exemple:

from pymfe.mfe import MFE
from sklearn import datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data  # we only take the first two features.
y = iris.target

a = MFE(groups="general", features=["freq_class"])
a.fit(X, y)
print(a.extract())

The results some times is:

(['freq_class.mean', 'freq_class.sd'], [0.33333334, 0.0])
or
(['freq_class.sd', 'freq_class.mean'], [0.0, 0.33333334])

This behavior occurs in module _internal.py, inside of _check_values_in_group function, because of the use of set(map(***)).
https://github.com/ealcobaca/pymfe/blob/master/pymfe/_internal.py#L210

One-hot encoding option with k-1 features rather than k

Is your feature request related to a problem? Please describe.
Yes. The current one-hot encoding option encodes each categorical attribute with k values using k different binary columns, which introduces noise and collinearity due to the "dummy variable trap".

Describe the solution you'd like
Remove arbitrarily the first column of each ohe-hot encoded attribute. The first value will then be represented by the null vector [0 0 ... 0]^T, and the multi-collinearity issue will vanish.

Describe alternatives you've considered
The current one-hot encoding with all k columns for each attribute may be kept as a third encoding option such as "one-hot-full".

Additional context
None.

Unsupervised task asks for target value

Describe the bug
When I try to extract features using MFE.fit() without passing an y I got the following error:

Traceback (most recent call last):
  File "metastream_clf/default-elec2.py", line 101, in <module>
    unsup_mfe.fit(xsel.values)
TypeError: fit() missing 1 required positional argument: 'y'

Not allowing to extract features in an unsupervised task.

To Reproduce
Steps to reproduce the behavior:
As described above, try to fit without passing the second argument (Y).

Expected behavior
As the documentation describes it also support unsupervised metafeatures, so this argument should be ignored if not passed.

Desktop (please complete the following information):

OS: Debian 10

New version 0.4.1

Update documentation
Update necessary things to create the new version
Update makefile
Update requirements

New discretization technique

Is your feature request related to a problem? Please describe.
Yes. Some metafeatures does not work well with the current discretization technique (bins with equal frequency).

Describe the solution you'd like
Implement a new discretization technique. The purposed new technique algorithm was:

For each numerical feature f_i do:
  1. Sort the values.
  2. Calculate the difference d_j of each neighbor point, and store its value in an array D.
  3. The new bin range for f_i will be mean(D). (or median(D).)
  4. Discretize f_i.

Describe alternatives you've considered
Alternative methods may be implemented in future, but it is not as necessary as the first alternative for the current method.

Additional context
None.

Choose a default (non-None) value for "ft_attr_conc"'s 'max_attr_num' parameter

Is your feature request related to a problem? Please describe.
Yes. 'ft_attr_conc' method takes excessively long time for datasets with large dimensions.

Describe the solution you'd like
A possible solution is to define a (non-None) default value for the 'max_attr_num' parameter of the method. This parameter works as follows: if the dataset has a dimension higher than 'max_attr_num' value, then 'max_attr_num' attributes are randomly sampled, and the feature is calculated only with the sampled attributes.

The default value 'max_attr_num' must be very low (10 - 20), since the feature computational cost grows with extremely high complexity.

Describe alternatives you've considered
A more complex solution would be to create a method decorator to mark which functions are costly, and create a MFE user parameter to choose if only the "cheap" (non-decorated) methods are extracted, or if all ("cheap" + "expensive" (decorated)) meta-features should be extracted.

This solution also enables more complex "package engineering", since we can also use the decorator in a gray-scale cost fashion (e.g. use the decorator with a integer parameter which defines the "cost" of the method, 1, 2, ..., and non-decorated methods are assumed cost minimal, 0) rather than a binary cost fashion ("cheap" and "expensive").

Personally, I don't think that this alternative solution does worth the effort, since it is unlikely that expensive meta-feature extraction methods will be implemented that often in this package.

Additional context
None.

[BUG] Unable to run MFE for datasets of more than ~500 features

Describe the bug
When running MFE with group general and a dataset with more than (around) 500 features, a RecursionError: maximum recursion depth exceeded while calling a Python object error is thrown.

To Reproduce
Steps to reproduce the behavior:

        mfe = MFE(groups=["general"])
        mfe.fit(X, y) # where X has more than 500 features

Expected behavior
Generate the general meta-features.

Screenshots
N/A

Desktop (please complete the following information):

OS: macOS
Version: Ventura 13.4

Additional context
The stack trace is as follows:

  File "[...]/lib/python3.8/site-packages/patsy/desc.py", line 400, in eval
    result = self._evaluators[key](self, tree)
  File "[...]/lib/python3.8/site-packages/patsy/desc.py", line 233, in _eval_binary_plus
    left_expr = evaluator.eval(tree.args[0])
  File "[...]/lib/python3.8/site-packages/patsy/desc.py", line 400, in eval
    result = self._evaluators[key](self, tree)
  File "[...]/lib/python3.8/site-packages/patsy/desc.py", line 233, in _eval_binary_plus
    left_expr = evaluator.eval(tree.args[0])
  File "[...]/lib/python3.8/site-packages/patsy/desc.py", line 394, in eval
    assert isinstance(tree, ParseNode)
RecursionError: maximum recursion depth exceeded while calling a Python object

The failure comes from patsy and seems to be related to what is mentioned in this issue in their repo. It is not fixed and they do not intend to do so, as the successor of patsy, formulaic already has this solved. My suggestion here would be to upgrade to formulaic, as patsy is no longer under active development (stated in their readme).

Binarization

The binarization method is doing some strange transformation. I tested on 02_postop_cat_only.csv dataset.

pymfe/pymfe/_internal.py

Line 1136 in 3c63045

return np.asarray(patsy.dmatrix(formula, named_data))

Merge with ts-pymfe

Is your feature request related to a problem? Please describe.
The pymfe expansion for time-series meta-feature extraction is already available.

Describe the solution you'd like
Merge ts-pymfe into pymfe.

Methods for supervised/unsupervised and for time-series meta-feature extraction can be put into different classes (e.g., MFE and TSMFE, respectivelly).

Describe alternatives you've considered
None.

Additional context

An architecture review must be done in the pymfe package to enable easy merging with the ts-pymfe, and also to prevent code dublication.
ts-pymfe is lacking unit tests, so it also must be done.

ealcobaca / pymfe Goto Github PK

pymfe's Introduction

Hi there! I am Edesio

pymfe's People

Contributors

Stargazers

Watchers

Forkers

pymfe's Issues

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Recommend Projects

Recommend Topics

Recommend Org

Jobs