GithubHelp home page GithubHelp logo

danielhomola / mifs Goto Github PK

View Code? Open in Web Editor NEW
287.0 287.0 110.0 59 KB

Parallelized Mutual Information based Feature Selection module.

License: BSD 3-Clause "New" or "Revised" License

Python 97.30% Makefile 2.70%

mifs's People

Contributors

danielhomola avatar joostjm avatar phs-sakshi avatar subrag avatar sumansaha66 avatar swapnilkura-tal avatar tagomatech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mifs's Issues

import fails using scikit-learn v 0.23.2

when importing mifs, there is an error relating to the importing of sklearn.externals.joblib: Parallel, delayed. it appears that the sklearn.externals.joblib library has been moved or is no longer a part of sklearn v0.23.2.

IndexError: index 10 is out of bounds for axis 0 with size 10

I get the following error message with my X sized 10 by 22,206. It does not happen using your artificial example data:

X, y = make_classification(n_samples=s, n_features=f, n_informative=i,

Does anyone have an idea where it comes from?

MIFS = mifs.MutualInformationFeatureSelector(method='MRMR', verbose=2, k=3, n_jobs=39)
MIFS.fit(X, y)
Auto selected feature #1 : 97, MRMR : 1.2456349206349207
Auto selected feature #2 : 105, MRMR : 1.174697546314963
Auto selected feature #3 : 2221, MRMR : 1.1440936480371255
Auto selected feature #4 : 2638, MRMR : 1.1159469926404155
Auto selected feature #5 : 2648, MRMR : 1.075870166881065
Auto selected feature #6 : 108, MRMR : 1.0689369414517065
Auto selected feature #7 : 4244, MRMR : 1.0607923083006399
Auto selected feature #8 : 1760, MRMR : 1.0263351177165052
Auto selected feature #9 : 74, MRMR : 1.0321992176732293
Auto selected feature #10 : 1540, MRMR : 1.0072073023785721
Auto selected feature #11 : 1931, MRMR : 1.0046566589987855
Traceback (most recent call last):
  File "/snap/pycharm-professional/213/plugins/python/helpers/pydev/pydevd.py", line 1448, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/snap/pycharm-professional/213/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/saskra/PycharmProjects/mifs/examples/examples.py", line 67, in <module>
    MIFS.fit(X, y)
  File "/home/saskra/PycharmProjects/mifs/mifs/mifs.py", line 196, in fit
    feature_mi_matrix[s, F] = mi.get_mi_vector(self, F, S[-1])
IndexError: index 10 is out of bounds for axis 0 with size 10
python-BaseException

Error calling fit function, classes of type String (Y)

Hello,
I'm trying to do feature selection for some Microarray data using this module but I'm failing to do so. My target Y contains values of type string (S3) which represent class names. The execution of

feat_selector = mifs.MutualInformationFeatureSelector()
feat_selector.fit(X, Y)

fails with the following error :

warnings.warn("Variables are collinear.")
Traceback (most recent call last):
  File "main.py", line 69, in <module>
    feat_selector.fit(X, Y.tolist())
  File "/usr/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 149, in fit
    return self._fit(X, y)
  File "/usr/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 193, in _fit
    self.X, y = self._check_params(X, y)
  File "/usr/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 308, in _check_params
    if self.categorical and np.any(self.k > np.bincount(y)):
TypeError: Cannot cast array data from dtype('S3') to dtype('int64') according to the rule 'safe'

I don't know a lot about these methods, does my Y have to be an Integer?

Question about using `knn.radius_neighbors`

Hi Daniel:

Thank you very much for reading this. This is not quite an issue but a question. I found you are passing a list as radius to knn.radius_neighbors function while I cannot find any documentation about it.

Doc of this function only said that it accepts a float as radius. Can you please help me to understand this code?

# find the distance of the kth in-class point
    for c in classes:
        mask = np.where(y == c)[0]
        knn.fit(x[mask, :])
        d2k[mask] = knn.kneighbors()[0][:, -1]

    # find the number of points within the distance of the kth in-class point
    knn.fit(x)
    m = knn.radius_neighbors(radius=d2k, return_distance=False)
    m = [i.shape[0] for i in m]

Thanks a lot.

Sparse data and memory

I'd like to try the techniques but have sparse text features. The current code doesn't run with sparse matrices. How hard would it be to change so it did?

If I do try to limit to less than 1000 features and make dense matrices the memory still seems to go above 20G even with 1 thread (num_cores forced to 1). Is this an intractable limitation of the techniques or is this unexpected?

Thanks.

blog post link is not working

hi

I am intereted in your work, and it seems that the blog post link in readme is not valid

I hope to learn more from your blog

thanks

ValueError: Expected 2D array, got 1D array instead: ...

Hi @danielhomola ,
Thanks for your code! When I was trying to use 'JMI' with 'categorical=False', I got the following error:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
1 start = time.time()
2 feat_selector = mifs.MutualInformationFeatureSelector(categorical=False, verbose=2, method='JMI')
----> 3 feat_selector.fit(X
, y_.reshape(-1,1))
4 print("{0:.3f} s".format(time.time()-start))

D:\Anaconda3\envs\Py2\lib\site-packages\mifs-0.0.1.dev0-py2.7.egg\mifs\mifs.pyc in fit(self, X, y)
147 self.n_jobs = NUM_CORES - self.n_jobs
148
--> 149 return self._fit(X, y)
150
151

D:\Anaconda3\envs\Py2\lib\site-packages\mifs-0.0.1.dev0-py2.7.egg\mifs\mifs.pyc in _fit(self, X, y)
191
192 def _fit(self, X, y):
--> 193 self.X, y = self._check_params(X, y)
194 n, p = X.shape
195 self.y = y.reshape((n, 1))

D:\Anaconda3\envs\Py2\lib\site-packages\mifs-0.0.1.dev0-py2.7.egg\mifs\mifs.pyc in _check_params(self, X, y)
294 ss = StandardScaler()
295 X = ss.fit_transform(X)
--> 296 y = ss.fit_transform(y)
297
298 # sanity checks

D:\Anaconda3\envs\Py2\lib\site-packages\sklearn\base.pyc in fit_transform(self, X, y, **fit_params)
516 if y is None:
517 # fit method of arity 1 (unsupervised transformation)
--> 518 return self.fit(X, **fit_params).transform(X)
519 else:
520 # fit method of arity 2 (supervised transformation)

D:\Anaconda3\envs\Py2\lib\site-packages\sklearn\preprocessing\data.pyc in fit(self, X, y)
588 # Reset internal state before fitting
589 self._reset()
--> 590 return self.partial_fit(X, y)
591
592 def partial_fit(self, X, y=None):

D:\Anaconda3\envs\Py2\lib\site-packages\sklearn\preprocessing\data.pyc in partial_fit(self, X, y)
610 """
611 X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
--> 612 warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES)
613
614 # Even in the case of with_mean=False, we update the mean anyway

D:\Anaconda3\envs\Py2\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
408 "Reshape your data either using array.reshape(-1, 1) if "
409 "your data has a single feature or array.reshape(1, -1) "
--> 410 "if it contains a single sample.".format(array))
411 array = np.atleast_2d(array)
412 # To ensure that array flags are maintained

ValueError: Expected 2D array, got 1D array instead:
array=[ 0.8762763 0.98324755 0.69945964 0.93662919 0.02916694 ... 0.53569545 ]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample._

I believe the array I input to the function is 2D, but it still outputs such error. If I change 'categorical=False' to 'categorical=True', then everything works well. Can you give me some help?
Thanks a lot!

ValueError: Expected 2D array, got 1D array instead

Using continuous variables. Took a simple input matrix X and target variable y. It seems to run in a problem using sklearn's preprocessing even though the matrix dimensions are correctly set:

  File "run.py", line 29, in <module>
    feat_selector.fit(X, y)
  File "/home/charles/corex/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 149, in fit
    return self._fit(X, y)
  File "/home/charles/corex/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 193, in _fit
    self.X, y = self._check_params(X, y)
  File "/home/charles/corex/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 296, in _check_params
    y = ss.fit_transform(y)
  File "/home/charles/corex/lib/python2.7/site-packages/sklearn/base.py", line 518, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/charles/corex/lib/python2.7/site-packages/sklearn/preprocessing/data.py", line 590, in fit
    return self.partial_fit(X, y)
  File "/home/charles/corex/lib/python2.7/site-packages/sklearn/preprocessing/data.py", line 612, in partial_fit
    warn_on_dtype=True, estimator=self, dtype=FLOAT_DTYPES)
  File "/home/charles/corex/lib/python2.7/site-packages/sklearn/utils/validation.py", line 410, in check_array
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 1.  1.  1. ...,  0.  0.  0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.```

unable to install

Hello Daniel,
I was trying to install this module to value the feature selection algorithm described, however when I went to install it it gives an error that a command needs to be supplied along with execution of the set up file.

Here is the method I used
python setup.py

Are there any commands that need to be passed with this?

Thanks!

Error when importing SelectorMixin

Hi,

First of all thanks for this amazing package! It literally saved my life!!

When I tried to import the package it failed with the following error:
ModuleNotFoundError: No module named 'sklearn.feature_selection.base'

I changed the following line of code in the mifs.py file for myself and that solved the issue for me. I am a bit new to github and programming so maybe I did something else wrong that created the problem for me. Nevertheless I thought it could not hurt to let you know ;)

from sklearn.feature_selection.base import SelectorMixin
To:
from sklearn.feature_selection import SelectorMixin

HELP!! All-NaN slice encountered

Using continuous variable, here I encountered error that return NaN slice.

File "", line 1, in
feat_selector.fit(X_train, y_train)

File "/home/lemma/miniconda2/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 149, in fit
return self._fit(X, y)
File "/home/lemma/miniconda2/lib/python2.7/site-packages/mifs-0.0.1.dev0-py2.7.egg/mifs/mifs.py", line 223, in _fit
S, F = self._add_remove(S, F, bn.nanargmax(xy_MI))

ValueError: All-NaN slice encountered.
My data looks like:
image

Not sure what's going on, although the code examples provided was running perfectly.

Are you sure y is categorical? It has more than 5 levels.

after i changed my OS from mint 17 to 18.1, i started getting this error. my framework using mutual information for feature selection in training dataset of malware analysis. i checked all dependencies for sklearn and all are satisfied with right version. i use sklearn 0.18.1 . i get like 11 MI, but then i get this:
Indices of selected features : 29,32,333,360,18,25,330,0,317,31,24
Saved to disk file named Selected_features_indexes.csv in current folder
Selected features. OK..........
Fetching relevant/selected features.....................
Fetched relevant/selected features.....................
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Reading training data....Wait
Read training data having dimestions =(4832, 12)
Preprocessing and spliting data using cross valiation
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, _DataConversionWarning)
Encoding class labels......................
['Backdoor', 'CleanWare', 'DangerousObject', 'Email-Worm', 'Hoax', 'Net-Worm', 'Packed', 'Trojan', 'Trojan-Downloader', 'Trojan-Dropper', 'Trojan-FakeAV', 'Trojan-GameThief', 'Trojan-PSW', 'Trojan-Ransom', 'Trojan-Spy', 'Virus', 'Worm']
total no of classes in dataset=17
Training dataset has samples=(3381, 11)
Test dataset has samples=(1450, 11)
Training the MLP classifier.....wait
Traceback (most recent call last):
File "trainMLP.py", line 54, in
my_mlp.MLP_classify( )
File "/home/android/Desktop/framework/TrainMLP ok/MLP_classify.py", line 81, in MLP_classify
MLPclf = MLPClassifier(algorithm='sgd', alpha=1e-5, hidden_layer_sizes=(500,), random_state=1, learning_rate='adaptive', max_iter =2500, verbose=False)
TypeError: init() got an unexpected keyword argument 'algorithm'

AttributeError: 'MutualInformationFeatureSelector' object has no attribute 'support_'

New issue found...

$ python3 examples.py 
/.../.local/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
Auto selected feature #1 : 11, JMI : 0.0684383100490189
Auto selected feature #2 : 10, JMI : 0.2072219896820795
Auto selected feature #3 : 13, JMI : 0.14946908422839966
Auto selected feature #4 : 8, JMI : 0.12245840290111998
Auto selected feature #5 : 14, JMI : 0.12245840290111998
Auto selected feature #6 : 4, JMI : 0.1000503773473076
Auto selected feature #7 : 0, JMI : 0.1000503773473076
Auto selected feature #8 : 6, JMI : 0.09800851343094807
Auto selected feature #9 : 5, JMI : 0.093447512891502
Auto selected feature #10 : 2, JMI : 0.093447512891502
Auto selected feature #11 : 7, JMI : 0.057598055546301374
Auto selected feature #12 : 9, JMI : 0.057598055546301374
Auto selected feature #13 : 3, JMI : 0.042444329907475264
Auto selected feature #14 : 1, JMI : 0.04044575359988212
Auto selected feature #15 : 64, JMI : 0.03845355583264709
Auto selected feature #16 : 88, JMI : 0.03793724882524119
Auto selected feature #17 : 83, JMI : 0.03793724882524119
Auto selected feature #18 : 47, JMI : 0.03793724882524119
Traceback (most recent call last):
  File "examples.py", line 50, in <module>
    sens, prec = check_selection(np.where(MIFS.support_)[0], i, r)
AttributeError: 'MutualInformationFeatureSelector' object has no attribute 'support_'

Problem in threadpoolctl: "NoneType object has no attribute split"

Hi
Thanks for your code, appreciated.
I am trying to use it with Python version 3.8.5 but unfortunate gives me the error in the subject. This happens when calling the the getconfig().split() method in threadpoolctl.
Have installed a conda environment with Python 3.7.2 (read somewhere it could be solved by downgrading the version), but same error.
Any ideas what could be the source?
I attach the whole error given, in case it helps.
Many thanks


AttributeError Traceback (most recent call last)
Input In [8], in <cell line: 5>()
2 feat_selector = MutualInformationFeatureSelector()
4 # find all relevant features
----> 5 feat_selector.fit(X, y)
7 # check selected features
8 feat_selector._support_mask

Input In [5], in MutualInformationFeatureSelector.fit(self, X, y)
171 S_mi = []
173 # ---------------------------------------------------------------------
174 # FIND FIRST FEATURE
175 # ---------------------------------------------------------------------
--> 177 xy_MI = np.array(get_first_mi_vector(self, self.k))
179 # choose the best, add it to S, remove it from F
180 S, F = self._add_remove(S, F, bn.nanargmax(xy_MI))

Input In [7], in get_first_mi_vector(MI_FS, k)
52 """
53 Calculates the Mututal Information between each feature in X and y.
54
55 This function is for when |S| = 0. We select the first feautre in S.
56 """
57 n, p = MI_FS.X.shape
---> 58 MIs = Parallel(n_jobs=MI_FS.n_jobs)(delayed(_get_first_mi)(i, k, MI_FS)
59 for i in range(p))
60 print("MIs: ", MIs)
61 return MIs

File d:\python38\lib\site-packages\joblib\parallel.py:1863, in Parallel.call(self, iterable)
1861 output = self._get_sequential_output(iterable)
1862 next(output)
-> 1863 return output if self.return_generator else list(output)
1865 # Let's create an ID that uniquely identifies the current call. If the
1866 # call is interrupted early and that the same instance is immediately
1867 # re-used, this id will be used to prevent workers that were
1868 # concurrently finalizing a task from the previous call to run the
1869 # callback.
1870 with self._lock:

File d:\python38\lib\site-packages\joblib\parallel.py:1792, in Parallel._get_sequential_output(self, iterable)
1790 self.n_dispatched_batches += 1
1791 self.n_dispatched_tasks += 1
-> 1792 res = func(*args, **kwargs)
1793 self.n_completed_tasks += 1
1794 self.print_progress()

Input In [7], in _get_first_mi(i, k, MI_FS)
66 if MI_FS.categorical:
67 x = MI_FS.X[:, i].reshape((n, 1))
---> 68 MI = _mi_dc(x, MI_FS.y, k)
69 else:
70 vars = (MI_FS.X[:, i].reshape((n, 1)), MI_FS.y)

Input In [7], in _mi_dc(x, y, k)
109 mask = np.where(y == c)[0]
110 knn.fit(x[mask, :])
--> 111 d2k[mask] = knn.kneighbors()[0][:, -1]
113 # find the number of points within the distance of the kth in-class point
114 knn.fit(x)

File d:\python38\lib\site-packages\sklearn\neighbors_base.py:822, in KNeighborsMixin.kneighbors(self, X, n_neighbors, return_distance)
815 use_pairwise_distances_reductions = (
816 self._fit_method == "brute"
817 and ArgKmin.is_usable_for(
818 X if X is not None else self.fit_X, self.fit_X, self.effective_metric
819 )
820 )
821 if use_pairwise_distances_reductions:
--> 822 results = ArgKmin.compute(
823 X=X,
824 Y=self.fit_X,
825 k=n_neighbors,
826 metric=self.effective_metric
,
827 metric_kwargs=self.effective_metric_params
,
828 strategy="auto",
829 return_distance=return_distance,
830 )
832 elif (
833 self._fit_method == "brute" and self.metric == "precomputed" and issparse(X)
834 ):
835 results = _kneighbors_from_graph(
836 X, n_neighbors=n_neighbors, return_distance=return_distance
837 )

File d:\python38\lib\site-packages\sklearn\metrics_pairwise_distances_reduction_dispatcher.py:258, in ArgKmin.compute(cls, X, Y, k, metric, chunk_size, metric_kwargs, strategy, return_distance)
177 """Compute the argkmin reduction.
178
179 Parameters
(...)
255 returns.
256 """
257 if X.dtype == Y.dtype == np.float64:
--> 258 return ArgKmin64.compute(
259 X=X,
260 Y=Y,
261 k=k,
262 metric=metric,
263 chunk_size=chunk_size,
264 metric_kwargs=metric_kwargs,
265 strategy=strategy,
266 return_distance=return_distance,
267 )
269 if X.dtype == Y.dtype == np.float32:
270 return ArgKmin32.compute(
271 X=X,
272 Y=Y,
(...)
278 return_distance=return_distance,
279 )

File sklearn\metrics_pairwise_distances_reduction_argkmin.pyx:90, in sklearn.metrics._pairwise_distances_reduction._argkmin.ArgKmin64.compute()

File d:\python38\lib\site-packages\sklearn\utils\fixes.py:72, in threadpool_limits(limits, user_api)
70 return controller.limit(limits=limits, user_api=user_api)
71 else:
---> 72 return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)

File d:\python38\lib\site-packages\threadpoolctl.py:171, in threadpool_limits.init(self, limits, user_api)
167 def init(self, limits=None, user_api=None):
168 self._limits, self._user_api, self._prefixes =
169 self._check_params(limits, user_api)
--> 171 self._original_info = self._set_threadpool_limits()

File d:\python38\lib\site-packages\threadpoolctl.py:268, in threadpool_limits._set_threadpool_limits(self)
265 if self._limits is None:
266 return None
--> 268 modules = _ThreadpoolInfo(prefixes=self._prefixes,
269 user_api=self._user_api)
270 for module in modules:
271 # self._limits is a dict {key: num_threads} where key is either
272 # a prefix or a user_api. If a module matches both, the limit
273 # corresponding to the prefix is chosed.
274 if module.prefix in self._limits:

File d:\python38\lib\site-packages\threadpoolctl.py:340, in _ThreadpoolInfo.init(self, user_api, prefixes, modules)
337 self.user_api = [] if user_api is None else user_api
339 self.modules = []
--> 340 self._load_modules()
341 self._warn_if_incompatible_openmp()
342 else:

File d:\python38\lib\site-packages\threadpoolctl.py:373, in _ThreadpoolInfo._load_modules(self)
371 self._find_modules_with_dyld()
372 elif sys.platform == "win32":
--> 373 self._find_modules_with_enum_process_module_ex()
374 else:
375 self._find_modules_with_dl_iterate_phdr()

File d:\python38\lib\site-packages\threadpoolctl.py:485, in _ThreadpoolInfo._find_modules_with_enum_process_module_ex(self)
482 filepath = buf.value
484 # Store the module if it is supported and selected
--> 485 self._make_module_from_path(filepath)
486 finally:
487 kernel_32.CloseHandle(h_process)

File d:\python38\lib\site-packages\threadpoolctl.py:515, in _ThreadpoolInfo._make_module_from_path(self, filepath)
513 if prefix in self.prefixes or user_api in self.user_api:
514 module_class = globals()[module_class]
--> 515 module = module_class(filepath, prefix, user_api, internal_api)
516 self.modules.append(module)

File d:\python38\lib\site-packages\threadpoolctl.py:606, in _Module.init(self, filepath, prefix, user_api, internal_api)
604 self.internal_api = internal_api
605 self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
--> 606 self.version = self.get_version()
607 self.num_threads = self.get_num_threads()
608 self._get_extra_info()

File d:\python38\lib\site-packages\threadpoolctl.py:646, in _OpenBLASModule.get_version(self)
643 get_config = getattr(self._dynlib, "openblas_get_config",
644 lambda: None)
645 get_config.restype = ctypes.c_char_p
--> 646 config = get_config().split()
647 if config[0] == b"OpenBLAS":
648 return config[1].decode("utf-8")

AttributeError: 'NoneType' object has no attribute 'split'

Error using JMI. MI is always negative?!

Hi Daniel;
Thanks for sharing the code!
I was trying to work with the Pima Indians Diabetes Data Set as example but I get the "ValueError: All-NaN slice encountered".
Digging a bit into the code, I found out that the MI calculated within the mi_dc function in the mi.py file is always negative so I always get NAN number what makes impossible any calculation.
Any idea?

Here is the script I am using for the example

import numpy as np
import urllib

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
raw_data = urllib.urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",")
X = dataset[:,0:8]
y = dataset[:,8]
MIFS = mifs.MutualInformationFeatureSelector(method='JMI' ,verbose=0, n_features = 8) 
y_int = y.astype(np.int64)
MIFS.fit(X,y_int)

AttributeError: 'module' object has no attribute 'MutualInformationFeatureSelector'

Traceback (most recent call last):
File "mifs.py", line 19, in
class MutualInformationFeatureSelector(object):
File "mifs.py", line 86, in MutualInformationFeatureSelector
import mifs
File "/home/sathya/Downloads/mifs-master/mifs/mifs.py", line 19, in
class MutualInformationFeatureSelector(object):
File "/home/sathya/Downloads/mifs-master/mifs/mifs.py", line 93, in MutualInformationFeatureSelector
feat_selector = mifs.MutualInformationFeatureSelector()
AttributeError: 'module' object has no attribute 'MutualInformationFeatureSelector'

I got this error when I run it
if Ichanged to "from mifs import MutualInformationFeatureSelector" i am getting an import error:cannot import name MutualInformationFeatureSelector

Information changes (increases/decreases) upon adding duplicate features

Hello authors,

I was analyzing the performance of the mifs feature selection and stumbled on a strange property. My features are continuous, my target is categorical. In my design matrix I accidentally used one pair of duplicate features, but to my surprise they were often both selected by the algorithm (I used JMIM). I managed to trace it to a property of the function _mi_dc(x, y, k). For features 1, 2, 3, with 1, 2 being exactly the same I obtained the following results of the _mi_dc (data attached):

  • [1] 0.302
  • [2] 0.302 - same, as expected
  • [1, 2] 0.400 - strange, 1 and 2 are duplicates there is no extra information, but information increased
  • [3] 0.389
  • [1, 3] 0.447 - increase, as expected
  • [2, 3] 0.447 - same increase, as expected
  • [1, 2, 3] 0.423 - very strange, information decreased when adding a redundant feature

Unless I don't understand some aspect of the calculation, it looks like adding a strictly redundant feature confuses the MI calculation. Could this be rectified? Or maybe there are some alternative options of calculating the joint MI? Of course one can detect that and skip exact duplicates, but it makes me a little hesitant about using the algorithm with many closely related features.

All my best,
Andrzej

mifs_data.txt

Fitting time too long

I am trying to fit a data with dimension around 600*11 with continuous Y (regression problem). However, it seems like the training is taking on forever. I don't know how it is going to work since I think I have done the right thing in setting up the code.

Problem with MRMR method

Hi! I'm having some trouble with the MRMR method when testing this code:

X = np.random.random((9000,20))
y = np.zeros(9000, dtype=int)
y[np.random.random(9000)>0.5] = 1

# define MI_FS feature selection method
feat_selector = mifs.MutualInformationFeatureSelector(method='MRMR', categorical=True, n_features='auto')

# find all relevant features
feat_selector.fit(X, y)

# check selected features
print(feat_selector.support_)

# check ranking of features
print(feat_selector.ranking_)

# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)
print(X_filtered)

I get this error:

Traceback (most recent call last):
  File "test_mifs.py", line 17, in <module>
    feat_selector.fit(X, y)
  File "/home/martin/Repositories/svm/lib/mifs.py", line 137, in fit
    return self._fit(X, y)
  File "/home/martin/Repositories/svm/lib/mifs.py", line 237, in _fit
    selected = F[bn.nanargmax(MRMR)]
  File "reduce.pyx", line 2907, in reduce.nanargmax (bottleneck/src/auto_pyx/reduce.c:25633)
  File "reduce.pyx", line 3552, in reduce.reducer (bottleneck/src/auto_pyx/reduce.c:31009)
  File "reduce.pyx", line 2943, in reduce.nanargmax_all_float64 (bottleneck/src/auto_pyx/reduce.c:25949)
ValueError: All-NaN slice encountered

the other two methods don't have any problem. I'm working with anaconda2 environment. I would appreciate your help!

Indexing for subsequent feature selection

Hi Daniel,

Thanks for sharing your code! I'm trying to use MRMR and I'm running into some problems. Using continuous features, the mutual information keeps producing negative values. I was able to get around this problem by using another MI implementation (see aside below) - however, the code used for selecting subsequent features does not match up with what I expected based on Peng's paper (see incremental algorithm, equation (7)). Here is the relevant code snippet from mifs.py lines 222-239:

    # ----------------------------------------------------------------------
    # FIND SUBSEQUENT FEATURES
    # ----------------------------------------------------------------------

    while len(S) < self.n_features:
        # loop through the remaining unselected features and calculate MI
        s = len(S) - 1
        feature_mi_matrix[s, F] = mi.get_mi_vector(self, F, s)

        # make decision based on the chosen FS algorithm
        fmm = feature_mi_matrix[:len(S),F]
        if self.method == 'JMI':
            selected = F[bn.nanargmax(bn.nansum(fmm, axis=0))]
        elif self.method == 'JMIM':
            selected = F[bn.nanargmax(bn.nanmin(fmm, axis=0))]
        elif self.method == 'MRMR':
            MRMR = xy_MI[F] - bn.nanmean(fmm, axis=0)
            selected = F[bn.nanargmax(MRMR)]

        # record the JMIM of the newly selected feature and add it to S
        S_mi.append(bn.nanmax(bn.nanmin(fmm, axis=0)))
        S, F = self._add_remove(S, F, selected)

The s that is passed to mi.get_mi_vector() should be the index of the previously selected feature according to the comment in get_mi_vector, but in this case s is just the number of currently selected features - 1 (which is meaningless when you use it to index X). Shouldn't the arguments be mi.get_mi_vector(self, F, S[-1]) so that you pass the index of the last selected feature?

Aside: I found implementations of MI for continuous and discrete features and outputs included in the newest development version (0.18.dev0) of sklearn (https://github.com/scikit-learn/scikit-learn/blob/0f2a00f/sklearn/feature_selection/mutual_info_.py#L290). Using mutual_info_regression, I don't get negative values for MI, so perhaps there are specific modifications that make it more stable.

Cheers,
Michael

Python 3 Compatibility: xrange in mi.py

Hi @danielhomola,
Congrats on passing your PhD viva .. well done!
Many thanks for putting this together, it's really useful
I have tried to use mifs using python 3.5 and it complained that it does not recognise xrange in mi.py.
https://github.com/danielhomola/mifs/blob/master/mifs/mi.py#L56

I have resolved this by defining xrange in mi.py as:

def xrange(*args, **kwargs):
      return iter(range(*args, **kwargs))

This was based on this answer:
https://stackoverflow.com/a/34950015

It also complained about the parentheses for print

Best wishes,
Noureddin

ValueError: numpy.nanargmax raises on a.size==0 and axis=None; So Bottleneck too.

Any attempt to call fit() raises an error ValueError: numpy.nanargmax raises on a.size==0 and axis=None; So Bottleneck too.

import numpy as np
import mifs

X = np.random.random(size=100).reshape((25,4))*100
y = np.random.random(size=25)*100

print(X.shape, y.shape)
sel = mifs.MutualInformationFeatureSelector(method='JMI', categorical=False)
sel.fit(X, y)

(25, 4) (25,)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-87-47bba24c3c23> in <module>()
  7 print(X.shape, y.shape)
  8 sel = mifs.MutualInformationFeatureSelector(method='JMI', categorical=False)
----> 9 sel.fit(X, y)

c:\users\a7slha\code\mifs\mifs\mifs.py in fit(self, X, y)
147             self.n_jobs = NUM_CORES - self.n_jobs
148 
--> 149         return self._fit(X, y)
150 
151 

c:\users\a7slha\code\mifs\mifs\mifs.py in _fit(self, X, y)
242             fmm = feature_mi_matrix[:len(S), F]
243             if self.method == 'JMI':
--> 244                 selected = F[bn.nanargmax(bn.nansum(fmm, axis=0))]
245             elif self.method == 'JMIM':
246                 if bn.allnan(bn.nanmin(fmm, axis = 0)):

ValueError: numpy.nanargmax raises on a.size==0 and axis=None; So Bottleneck too.

Negative Entropy Values and More

Hello. Thanks for this package, but I am running into a lot of troubles with it.

First of all in mi.py you use entropy implementation by Gael Varoquaux which gives negative MI's. I replaced that with sklearn's MI, and got rid of that problem, but still the features end up being chosen don't make sense.

I used iris dataset from sklearn. I replicate a feature, but as you can see the method here ends up picking up the same feature twice which shouldn't be the case. Here is the MWE:

import pandas as pd
import mifs
import pandas as pd

from sklearn import datasets
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)
import numpy as np

X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']].values
print (X[:5, :])
X = np.hstack((X[:,2].reshape((-1, 1)), X))
print (X[:5, :])
y = iris_df['petal width (cm)'].values.reshape((1, -1)).squeeze()

# define MI_FS feature selection method
feat_selector = mifs.MutualInformationFeatureSelector(categorical=False, n_features=2)

# find all relevant features
feat_selector.fit(X, y)

# check selected features
print (feat_selector._support_mask)

# check ranking of features
print (feat_selector.ranking_)

# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)

you can comment or uncomment the appending line.

Also there was no attribute called support for feat_selector and I had to replace that with _support_mask in your example. The code I changed was only the function _get_first_mi
and it is changed to:

def _get_first_mi(i, k, MI_FS):
    n, p = MI_FS.X.shape

    if MI_FS.categorical:
        x = MI_FS.X[:, i].reshape((n, 1))
        MI = _mi_dc(x, MI_FS.y, k)
    else:
        vars = (MI_FS.X[:, i].reshape((n, 1)), MI_FS.y)

        MI = _mi_cc(vars, k)
        from sklearn.feature_selection import mutual_info_regression
        MI_2 = mutual_info_regression(vars[0], vars[1],n_neighbors=k)
    MI = MI_2[0]
    # MI must be non-negative
    if MI > 0:
        return MI
    else:
        return np.nan

transform function for feature selector

Hi sir! I am currently developing my thesis and I want to use your library and the method mRMR for my dataframe. I perform the method and I get the features that satisfy the condition. Although when I perform the fit_transform method I get different features from those that have been chosen from the mRMR method. I find it when I checked the data in my dataFrame. I tried to find where you define the transform method for the feature selector but I couldn't. Can you help me please?
Thank you in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.