gaa-uam / scikit-fda Goto Github PK

Functional Data Analysis Python package

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

alignment classification clustering curves dimensionality-reduction functional-data-analysis functions machine-learning python python3 registration regression scikits smoothing statistics visualization

scikit-fda's People

Contributors

Stargazers

Watchers

scikit-fda's Issues

K-means tolerance

This tolerance is not the same that Sklearn uses. Sklearn uses a L2 norm, while you are using something similar to a L-infinity norm

Originally posted by @vnmabus in #93

Problem with pandas methods

After update the neighbors brach (#112), I have an error on this line:

scikit-fda/examples/plot_k_neighbors_classification.py

Line 61 in 91d96ec

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0) 

Traceback (most recent call last):
File "plot_k_neighbors_classification.py", line 63, in
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
File "/Users/pablomm/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 2124, in train_test_split
safe_indexing(a, test)) for a in arrays))
File "/Users/pablomm/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 2124, in
safe_indexing(a, test)) for a in arrays))
File "/Users/pablomm/anaconda3/lib/python3.6/site-packages/sklearn/utils/init.py", line 219, in safe_indexing
return X.take(indices, axis=0)
TypeError: take() got an unexpected keyword argument 'axis'

It seems that the sklearn API detects that the FData has the take method and uses it to slice the samples, but using the axis argument.

In fact, the take method also does not work properly without the axis parameter.

>>> import skfda
>>> a = skfda.datasets.make_sinusoidal_process()
>>> a.take([1,2,3])

Traceback (most recent call last):
File "", line 1, in
File "/Users/pablomm/anaconda3/lib/python3.6/site-packages/scikit_fda-0.2.3-py3.6-macosx-10.7-x86_64.egg/skfda/representation/_functional_data.py", line 1288, in take
return self._from_sequence(result, dtype=self.dtype)
File "/Users/pablomm/anaconda3/lib/python3.6/site-packages/scikit_fda-0.2.3-py3.6-macosx-10.7-x86_64.egg/skfda/representation/_functional_data.py", line 1224, in _from_sequence
return cls(scalars, dtype=dtype)
TypeError: init() got an unexpected keyword argument 'dtype'

Matplotlib deprecation warning - mpldatacursor

After update matplotlib I get the following warning when running the example plot_clustering.py.

/Users/pablomm/anaconda3/lib/python3.6/site-packages/mpldatacursor-0.6.2-py3.6.egg/mpldatacursor/convenience.py:160: MatplotlibDeprecationWarning:
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
if not cbook.iterable(axes):

It is due to the mpldatacursor package. There is an open issue in its repository, but it seems that the package is no longer actively maintained (The last commit is from 2016).

Perhaps we should change its use to another package, such as mplcursors, which offers the same functionality, but is currently active.

Version information

OS: MacOs
Python version: 3.6.5
Matplotlib: 3.1.1
Mpldatacursor: 0.6.2
scikit-fda version: develop

Random seed in tests

The random state of the tests of magnitude shape plot is not being fixed, which makes the tests fail.

https://github.com/GAA-UAM/fda/blob/28a6e0b58053c29dd4fca0624d85ee44e00bbf19/fda/magnitude_shape_plot.py#L378

In the example of surface boxplot the random seed isn't being fixed either to generate the datasets.

Problem in fpca

Second parameter in the function fpca is not used.

https://github.com/GAA-UAM/fda/blob/0b13e96fbf5012dca863ea48a5559cc24d3fbd31/fda/math.py#L408-L432

Maybe the default value of n should return all the principal components.

Typo in documentation

Little typo in the reference of the fda book, in the readme and in the documentation of the source code.

Ramsay, J., Silverman, B. W. (2005). Functional Data Analysis. ~~Springler~~ Springer.

Retrieve coefficients for function reconstruction

Hey there!

I fitted a KNN FDataGrid to my input data. It actually looks pretty good so far and I now would like to "export" it so I can represent the function as numerical values (preferable in a numpy array).

I saw that you offer some basis that can be used to "export" the underlying representation. Could you elaborrate on what basis should be used when?
My data represents a demand/supply curve. I tried the BSpline one but it only constructs something close to a sine wave which doesn't really represent my data.

Here is an image of the graph itself:

Is there some way to get the raw representation instead of transforming it to another basis?

Linear differential operator is not being built properly.

>>> from skfda.misc import LinearDifferentialOperator as Ldf
>>> Ldf(weights=[3, 4, 5])
LinearDifferentialOperator(
    nderiv=2,
    bwtlist=[
    FDataBasis(
        basis=Constant(domain_range=[array([0, 1])], nbasis=1),
        coefficients=[[3]],
        dataset_label=None,
        axes_labels=None,
        extrapolation=None,
        keepdims=False),
    FDataBasis(
        basis=Constant(domain_range=[array([0, 1])], nbasis=1),
        coefficients=[[4]],
        dataset_label=None,
        axes_labels=None,
        extrapolation=None,
        keepdims=False)]

Originally posted by @pablomm in #139

Missing examples

I open this issue to list missing examples that should be in the documentation:

A good FDataBasis tutorial, presenting each basis and comparing between them.
UEA multivariate datasets
A numpy ufuncs + FDataGrid example
Feature selection examples
Examples on Pandas integration

ImportError running test cases

Hi developer,

Sorry if my question is more like seeking help rather than contributing to the package development. I am trying to run your fda package locally while I could not build it successfully by running into:
ImportError: dynamic module does not define module export function (PyInit_optimum_reparam_extension), and error:
import optimum_reparam_extension.
I wish this package performs similar function as the fda package in R. I am brand new to Github, so all I did after downloading the source code from Github using "git" command is installing all the requirements in requirements.txt and building the project by running "python3 setup.py ". For , I ran "build", "install", "bdist" and "check". If it ever helps, I am using Python 3.6.4 on a Mac Mojave 10.14. It would be more than great to get your help on where might be going wrong. I indeed have no idea if I should expect this "optimum_reparam_extension" file to be originally in your package, or it should be something generated after I build it myself. Great many thanks!

Rename interpolator to interpolation, or extrapolation to extrapolator

The naming should be consistent between these properties.

Allow labeling of functional observations

A user has commented us the need for preserving labels for the observations. We should consider the possibility of adding them or, probably even better, allow an xarray to be used internally.

Problem with colors when are rgb tuples

The following example crashed due to the use of numpy array to store the colors added in #66
https://github.com/GAA-UAM/fda/blob/940caa755e40cf26395e22a772806cba3ff18fd7/examples/plot_pairwise_alignment.py#L126-L133

ndim_codomain must have the @property decorator

https://github.com/GAA-UAM/fda/blob/f96534468bc585157c6b5294fc768f02589eeb2b/fda/functional_data.py#L104

mpldatacursor

Is this using global state? Because you are not passing any parameter to link this with the figure.

Originally posted by @vnmabus in #93

Local Outlier Factor for functional data

The LocalOutlierFactor estimator was added in the private module skfda._neighbors.outlier.LocalOutlierFactor, without references in the documentation, until the theoretical aspects of this method are clarified in the context of FDA (see #164).

The example written for this class can be found at https://gist.github.com/pablomm/eb93c469473ea76baed7e3e72578de68

Deepcopy of FDataBasis object

Shouldn't it make a copy of the coefficients too?

https://github.com/GAA-UAM/fda/blob/92c984cb42a505661716c26f4cd56a2f1e69e26f/fda/basis.py#L1762-L1763

Numpy Future Warning

Since numpy 1.16 a warning is raised in basis.py:

/home/travis/build/GAA-UAM/fda/fda/basis.py:917:
FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support
for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an
error in the future.
self.knots[:-1]))

Unify scalar and functional regressors

I was mentioned in a meeting that it makes no sense to differentiate between scalar and functional regressors. First, because there are not scalar, but multivariate, and second, because there is no confusion: the output of predict depends on the input to fit.

I forgot to mention this on the review of #112, but it should not be difficult to change. Can you do it @pablomm ?

Rename nsamples to n_samples

FData class

In the branch feature/fdata I created the class FData to unify the API of FDataBasis and FDataGrid. It is just a suggestion of a possible API. I would like to discuss which methods are more appropriated to be in the common class and the interface of the methods.

In particular the api of evaluate and the parameter keepdims.

https://github.com/GAA-UAM/fda/blob/859bb4055bd9a325fb1321ddb6b4d761013489ad/fda/fdata.py#L47-L50

FDataGrid indexing

Current __getitem__ implementation does not support indexing properly.

Fourier period

I find on the original fda R code that the period of Fourier Basis is diff(domain_range) by default but we set it to 1 by default. Is it an issue or a improvement?

Domain range on bspline creation

If I create a BSpline in this way

BSpline(nbasis=6, order=4, knots=[0, 0.3, 0.3, 1])

It throws an error , but I think it should take de domain_range from the first and last knot

https://github.com/GAA-UAM/fda/blob/9893dccf16a5742399dfa5c9af0ffdc32697b668/fda/basis.py#L549-L566

Class documentation should show methods

Make class documentation show the complete documentation for each method, as in Scikit-learn.

Registration and interpolation to-do's

To-do list not covered in #9:

Registration

Tests of registration.py
Doctests of registration.py
Review documentation of registration.py
Add sample generators to make examples of registration.py
Examples of registration.py

Interpolation

Write documentation of grid_interpolation.py
Review documentation of grid_interpolation.py
Tests of interpolation
Examples of interpolation
Inherit from ABC

Extrapolation

Consider more appropriate names for the extrapolation types
Think how to unify the extrapolation in the evaluate methods of FDataBasisand FDataGrid
Decide if the default extrapolation of each type of basis should be a class attribute or an attribute of the instance

Error compiling pdf docs

The pdf documentation fails. We should fix it.

Warning fetching tecator

A warning is generated when fetch_tecator is called.

fda.datasets.fetch_tecator()
Using TensorFlow backend.
/Users/pablomm/anaconda3/lib/python3.6/site-packages/rdata-0.2.1-py3.6.egg/rdata/conversion/_conversion.py:197: UserWarning: Unknown encoding. Assumed ASCII.
warnings.warn(f"Unknown encoding. Assumed ASCII.")

Error while importing skfda

The installation goes through but on importing skfda I run into error ImportError: cannot import name 'OutlierMixin'

Remove `shape` property of FDataGrid

ndim is going to change to return 1, for Pandas compatibility. Thus, it does not longer coincide with the dimensions of shape, which are those of the data matrix. As the shape can be obtained directly from the matrix, I propose to remove this method.

Travis doctest

After #96 travis are not being running the doctests (or it is not being shown in the travis log).

First plot messes figure and subplots

The first plot of a multivariate functional object is messed up. Subsequent plots are ok.

Add matplotlib as a dependency

Matplotlib is currently imported, but it is not marked as a dependency in setup.py

Plot function on Basis

In this functions if checks for ndim_domain and ndim_image but this are attributes from FData object, not Basis objects

https://github.com/GAA-UAM/fda/blob/92c984cb42a505661716c26f4cd56a2f1e69e26f/fda/basis.py#L138-L155

Add Cython as a dependency

Cython should be a dependency. Here says how to do that.

Lp-distance matrix

Description
In the documentation for skfda.misc.metrics.lp_distance it is said that the function will calculate the distance between all possible pairs of samples between two FDataGrid objects. In this moment, it only calculates the distance between the n-th sample from the first FDataGrid and the n-th sample from the second one.

Possible solution
Let fd1 be the first FDataGrid with samples [f_11, ..., f_1n] and fd2 the second FDataGrid with samples [f_21, ..., f_2m]. The function should return a nxm matrix where the component i,j should be d(f_1i, f_2j).

References
The functionality described here is the followed in fda.usc metric.lp.

Missing methods and functions from fda

I open this issue to discuss missing functionality from the fda package. I will only put here the missing parts. In the wiki there is the full comparative.

General functionality

Datasets

Continuously Stirred Tank Reactor (CSTR) Ordinary Differential Equations (ODEs).
Obtain data from the Human Mortality Database.

Plotting

Plot cycles.
Principal differential analysis plots (Stability Analysis).
Phase-plane plot.
Plot Functional Canonical Correlation Weight Functions.
Plot PCA.
Plot functional parameter objects with confidence limits.
Plot real data + Functional data.
Plot the results of the registration of a set of curves.
Plot Principal Component Scores.

Dimensionality reduction

PCA.
Principal differential analysis.

Registering

Regression

Functional linear regression with scalar response.
Functional concurrent linear regression with functional response.
Fully functional linear regression with functional response.
Winsorized regression.

Surface plots do not intersect

Surface plots are plotted one in top of another. Thus, they do not intersect where they should.

FDataBasis as list

to_list method of FdataBasis is not longer needed after #68.

scikit-fda/skfda/basis.py

Line 1962 in 0a6205e

def to_list(self):

Now the object can be unpacked as a list using the list constructor.

a = FDataBasis(...)
list(a)

Enhancement in sample_labels

The current sample_labels argument of plot to group samples by colors is very restrictive, only accepts labels of the form 0, 1, 2, ... n_classes -1 without skip any number.

Can be fixed using LabelEncoder internally.

To Reproduce
Code to reproduce the behavior:

from skfda.datasets import make_sinusoidal_process
fd = make_sinusoidal_process(n_samples=5)

fd.plot(sample_labels=[0, 0, 2, 2, 2])  # Not valid
fd.plot(sample_labels=[-1, -1, 0, 1, 1]) # Not valid

Result

ValueError: sample_labels must contain at least an occurence of numbers between 0 and number of distint sample labels.

Bug in plot function

When a FData with an unidimensional domain object is plotted after other plot a matplotlib deprecated warning is obtained, but the result is correct.

>>> import matplotlib.pyplot as plt; import fda
>>> a = fda.datasets.make_multimodal_samples(ndim_domain=1)
>>> a.plot()
(<Figure size 640x480 with 1 Axes>, [<matplotlib.axes._subplots.AxesSubplot object at 0x1c1133a3c8>])
>>> a.plot()
/Users/pablomm/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)
(<Figure size 640x480 with 1 Axes>, [<matplotlib.axes._subplots.AxesSubplot object at 0x1c1133a3c8>])
>>> plt.plot()

In case of surfaces, the result is an empty drawing, because the second call creates a second axes on top of the previous one, but draws on the first ax.

>>> a = fda.datasets.make_multimodal_samples(ndim_domain=2)
>>> a.plot()
(<Figure size 640x480 with 1 Axes>, [<matplotlib.axes._subplots.Axes3DSubplot object at 0x1c22094400>])
>>> a.plot()
(<Figure size 640x480 with 2 Axes>, [<matplotlib.axes._subplots.Axes3DSubplot object at 0x1c22094400>, <matplotlib.axes._subplots.Axes3DSubplot object at 0x1c22106400>])
>>> plt.show()

Pandas integration

Allow to use functional data objects as Pandas columns. Useful to treat functional data as an atomic unit, while allowing it to be mixed with univariate/multivariate data in the datasets.

Similar functionality has been done for tidyfun in R.

Bug in basis covariance

In the FDataBasis method cov is returned the variance instead of the covariance.

https://github.com/GAA-UAM/fda/blob/51abda6d1b42ff71b317f661de5dae29e8777a7a/fda/basis.py#L1593-L1611

Smoothing in several dimensions

The current smoothers only work for one-dimensional functions. It should be reasonably easy to extend them to several dimensions.

Magic constants

Magic constants are used in the discretisation of the FDataBasis, for instance., in FDataBasis.to_grid or in the registration and regression methods. We should think how to manage it, with an enum or something similar.

https://github.com/GAA-UAM/fda/blob/0b13e96fbf5012dca863ea48a5559cc24d3fbd31/fda/basis.py#L1411

https://github.com/GAA-UAM/fda/blob/62c5f518ab2e13b9a4cb9903ecf69f61dcf798cd/fda/basis.py#L1459

Fourier basis in representation example

scikit-fda/examples/plot_representation.py

Lines 97 to 104 in ee8b316

 # We can also see the effect of changing the basis. 

 # For example, in the Fourier basis the functions start and end at the same 

 # points, so this basis is clearly non suitable for the Growth dataset. 

 fd_basis = fd.to_basis( 

 basis.Fourier(domain_range=fd.domain_range[0], nbasis=7) 

 ) 

 fd_basis.plot()

I don't know if this sentence is completely correct. The functions will start and end at the same point if the period of the basis is the same than the domain range, but we can set the period to 2*|domain_range| to avoid this problem.

period = 2 * ( fd.domain_range[0][1] - fd.domain_range[0][0])
fd_basis = fd.to_basis( 
      basis.Fourier(domain_range=fd.domain_range[0], nbasis=7, period=period)
  )
fd_basis.plot()

Result:

Documentation

I open this issue to unify all pending tasks (in my opinion) with respect to documentation.

FDatagrid.to_basis and basis range

Currently, the range of the basis in the FDatagrid.to_basis is assigned when the basis object is created, and by default in the [0, 1] interval. It would be useful, to prevent confusion, that the basis range, if not especifically set at creation time, is set inside the to_basis method to the domain range of the FDatagrid object.

Windows build failure

I have an error related with the windows build.

https://travis-ci.org/GAA-UAM/scikit-fda/jobs/550183221

Domain range on basis multiplication

On the original fda R package it returns a value error when you try to multiply two basis of different domain range.

It could be possible to intersect the domain ranges to perform the multiplication?

	# We can also see the effect of changing the basis.
	# For example, in the Fourier basis the functions start and end at the same
	# points, so this basis is clearly non suitable for the Growth dataset.
	fd_basis = fd.to_basis(
	basis.Fourier(domain_range=fd.domain_range[0], nbasis=7)
	)

	fd_basis.plot()