GithubHelp home page GithubHelp logo

sdv-dev / copulas Goto Github PK

View Code? Open in Web Editor NEW
505.0 22.0 104.0 22.27 MB

A library to model multivariate data using copulas.

Home Page: https://sdv.dev/Copulas/

License: Other

Python 95.19% Makefile 2.17% R 1.46% MATLAB 1.18%
machine-learning copulas synthetic-data data-generation synthetic-data-generation generative-ai generative-model tabular-data

copulas's People

Contributors

aliciasun avatar amontanez24 avatar csala avatar fealho avatar frances-h avatar gbonomib avatar jdtheripperpc avatar k15z avatar katxiao avatar kveerama avatar lajohn4747 avatar manuelalvarezc avatar nazar-ivantsiv avatar paulolimac avatar pvk-developer avatar r-palazzo avatar rollervan avatar sdv-team avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

copulas's Issues

Add serialization to Vines

Add methods from_dict and to_dict for Vine Copulas that returns a dictionary with all it's internal parameters and is able to create a new instance from them.

Make vine copulas sample use num_rows arg.

Currently the sample method on copulas.multivariate.vine.VineCopulas doesn't take in consideration the argument num_rows, would be useful to either delete it, or make it work.

Update VineCopula docstring

VineCopula docstring is out of date, as says that the vine_type should be ctype, rtype and dtype, when the actual specification is center, regular, direct.

Vine Copulas not working

Copulas version: 0.1
Python version: 3.6.6
Operating System: Fedora release 28 (Twenty Eight)

Description

I tried to use VineCopula with a simple dataset like the Breast Cancer dataset and got an error.

What I Did

from copulas.multivariate import VineCopula
import pandas as pd
from sklearn.datasets import load_breast_cancer


data = pd.DataFrame(load_breast_cancer()['data'])
c = VineCopula('center')
c.fit(data)

produced

/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/frank.py:76: RuntimeWarning: divide by zero encountered in log
  return -1.0 / self.theta * np.log(1 + num / den)
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:60: RuntimeWarning: overflow encountered in power
  for i in range(len(U))
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:102: RuntimeWarning: overflow encountered in power
  B = np.power(V, -self.theta) + np.power(U, -self.theta) - 1
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:101: RuntimeWarning: overflow encountered in power
  A = np.power(V, -self.theta - 1)
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/clayton.py:104: RuntimeWarning: invalid value encountered in multiply
  return np.multiply(A, h) - y
/home/echo66/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/scipy/optimize/minpack.py:163: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-56-ab3f90cc1dd4> in <module>
      3 c = VineCopula('center')
      4 
----> 5 c.fit(data)

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/vine.py in fit(self, X, truncated)
     51             self.ppfs.append(uni.percent_point)
     52 
---> 53         self.train_vine(self.type)
     54 
     55     def train_vine(self, tree_type):

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/vine.py in train_vine(self, tree_type)
     66             LOGGER.debug('start building tree: {0}'.format(k))
     67             tree_k = Tree(tree_type)
---> 68             tree_k.fit(k, self.n_var - k, tau, self.trees[k - 1])
     69             self.trees.append(tree_k)
     70             LOGGER.debug('finish building tree: {0}'.format(k))

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/tree.py in fit(self, index, n_nodes, tau_matrix, previous_tree, edges)
     86                 self._build_kth_tree()
     87 
---> 88             self.prepare_next_tree()
     89 
     90     def _check_contraint(self, edge1, edge2):

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/multivariate/tree.py in prepare_next_tree(self)
    196 
    197             copula = Bivariate(edge.name)
--> 198             copula.fit(X_left_right)
    199             left_given_right = copula.partial_derivative(X_left_right, copula_theta)
    200             right_given_left = copula.partial_derivative(X_right_left, copula_theta)

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/base.py in fit(self, X)
     87         self.tau = stats.kendalltau(U, V)[0]
     88         self.theta = self.compute_theta()
---> 89         self.check_theta()
     90 
     91     def to_dict(self):

~/.local/share/virtualenvs/SDV-phkC4KIc/lib/python3.6/site-packages/copulas/bivariate/base.py in check_theta(self)
    212         if (not lower <= self.theta <= upper) or (self.theta in self.invalid_thetas):
    213             message = 'The computed theta value {} is out of limits for the given {} copula.'
--> 214             raise ValueError(message.format(self.theta, self.copula_type.name))
    215 
    216     def check_fit(self):

ValueError: The computed theta value nan is out of limits for the given CLAYTON copula.

Remove need for data in Gaussian Copula Sample

The sample method for a gaussian copula currently requires a that the data attribute exist. This is not correct since a user should be able to create a copula by just setting the parameters and still sample.

This is on line 91 and 92 of the gaussian copula file.

python2 compatibility

  • Copulas version: 0.1.0
  • Python version: 2.6
  • Operating System:

Description

Please could you explicitly inherit from object when declaring your base classes 'BVCopula', 'MVCopula', 'UnivariateDistrib' and others I may have missed. This should allow compatibility with python2.
Many thanks!

What I Did

The error when using python2 is ""TypeError: must be type, not classobj" whenever super is called.

I did some quick testing of parts of the codebase (not 100% coverage) to verify and explicitly inheriting from object does seem to be the only limiting factor preventing python2 compatibility.

fix numpy runtime warning

Description

Currently there are warnings in fit method that due to divided by zero. In that case, theta should be set to infinity and verify the computation is still correct for get_pdf(),get_cdf() etc.

Further Improvement

Maybe add a CopulaException class to ensure theta is in the valid range instead of checking inside each function.

Add tests for analytics properties of copulas.

Copulas, as mathematical functions should fulfill some analytical properities:

  • The copula is zero if one of the arguments is zero.
  • The copula is equal to u if one argument is u and all others 1.

It would be nice to have one unittest for each property and copula on our test suite.

Make serialization of models flat.

Currently, the methods to serialize Copulas can return nested dictionaries, which are not useful to work with in some use cases. We can change the way to_dict in order to keep the information of the internal structure in the keys, something like:

>>> copula.to_dict() # actual implementation
{
    'one_attribute': 0
    'nested_attribute': {
        'foo': 'bar
    }

}


>>> copula.to_dict() # Desired behavior
{
    'one_attribute': 0
    'nested_attribute__foo': 'bar'
}

Update Documentation

In order to make project easy to use for new users, we should have:

  • An updated and complete README ( Showing examples of Vines, listing all the copula types, ...)
  • Docstrings on methods showing expected parameters and types, and usage examples.
  • A contribution guide

Integrate with CodeCov

Integrate with CodeCov to make sure all the changes from PR improve the code coverage of tests.

seed for the random numbers generators

  • Copulas version: 0.1
  • Python version: 3.6.6
  • Operating System: Fedora release 28 (Twenty Eight)

Description

I was expecting the sample methods to allow the user to pass a seed for the random numbers generators.

What I Did

Instead, we have to use, outside the function call, numpy.random.seed(seed_value) and random.setstate(seed_value). This is a bad practise from a software engineering standpoint and it is very error prone because it affects the global state. Also, this can negatively impact experiment reproducibility and the debugging stages.

Recommendations

Currently, in order to get the same sample from the sampling methods, we need to

  1. invoke np.random.seed(seed_value)
  2. invoke random.setstate(random_state_tuple)

outside of the sampling function being invoked (i.e. sample). This results in what is called, in software engineering, a leaky abstraction. In order to solve this issue with seed control, there are (at least) two approaches:

  1. Create a parameter, in the sample methods, named seed or random_state.
  2. Create a parameter in the constructor of classes offering the sample method named seed or random_state IF the distribution fit method requires some sort of stochastic process.

In scikit-learn and other popular python machine learning tools, what happens is the following

  1. When a model depends on some sort of stochastic process during the fit procedure, the model class constructor allows the user to set the random_state value. This value can be one of 3 things: None, an integer or an instance of numpy.random.RandomState instance. No matter what the value is, it will be checked and processed by sklearn.utils.check_random_state, which will output a numpy.random.RandomState instance. Note that the sklearn.utils.check_random_state method will be invoked at the beginning of the fit method (check this example: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L129). If you set random_state as an integer, every fit has to be deterministic in its behavior and output.

  2. If, besides the fit method, there is another method that depends on stochastic processes (e.g. the sample method in sklearn.neighbors.KernelDensity), we are allowed to control the seed through the random_state parameter.

  3. In other more low level APIs like scipy, the seed must be an integer or None.

I also advise against using both the random and numpy.random modules at the same time because it makes the seed management harder.

EDIT: Current fix available at #62

Error on vine copulas sampling - 'Edge' object has no attribute 'index'

  • Copulas version: 0.2.0
  • Python version: 3.6.5
  • Operating System: Xubuntu 18.04

Description

Got and exception while trying to sample data using Vine Copulas with Regular Tree.

What I Did

import pandas as pd
from copulas.multivariate import VineCopula, TreeTypes

X = pd.DataFrame([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
])
vine = VineCopula(TreeTypes.REGULAR)
vine.fit(X)
vine.sample()

which gave me the following exception:

copulas/multivariate/vine.py in sample(self, num_rows)
    165                             if (edge.L == current and edge.R == visited[0]) or\
    166                                (edge.R == current and edge.L == visited[0]):
--> 167                                 current_ind = edge.index
    168                                 break
    169                         else:

AttributeError: 'Edge' object has no attribute 'index'

Add support to Python 3.7

Currently, we only support python 3.5 and 3.6. We need to add the newest version of python. To do so, we need to check:

  1. All dependencies of the package are compatible with 3.7
  2. The project builds after adding environment 3.7 on TravisCI
  3. The supported versions are correctly listed in setup.py

Enforce python naming conventions

Filenames must follow python naming conventions and shouldn’t be redundant (univariate/GaussianUnivariate.py -> univariate/gaussian.py)
Function and variables names shouldn’t be acronyms but explicit and clear names (Copulas.cdf -> Copulas.cumulative_distribution)

Add CLA

Configure a service to allow contributors to sign a CLA before submiting their contributions.

Integrate with TravisCI

Integrate with TravisCI in order to:

  • Run tests on each commit
  • Build documentation
  • Release to PyPI

Implement `partial_derivative_scalar` in Bivariate Base class

The method _partial_derivative is being used in the Vine classes:

https://github.com/DAI-Lab/Copulas/blob/57e4eb3a462e0ccffc25cc4bedd5a413304fe27a/copulas/multivariate/vine.py#L187

However, this method should is not intended to be called from outside the Bivariate copula classes (as it starts with an underscore), and is not implemented in all the Bivariate subclasses.

In order to fix this, the method should be moved to the Bivariate class and renamed to partial_derivative_scalar or similar.

Changes in API

Right now, bivariate and univariate copulas have methods to compute the different probability functions, however, this functions return another function that is later called.

We should change this behavior to have functions that return the actual result values, instead of a function.

The list of changes to do is:
1-. First, rename all functions with descriptive names instead of acronims.
2-. Make the probability functions returns values instead of a function.
3-. Make the probability functions not require arguments that can be taken from self.
4-. Unify types for input and output values, making all classes only accept and return np.ndarray

Add unit test for sample generation methods

Description

The behavior of the sampling methods need to be throughly tested. The goal is to verify that the data used to fit the model and the samples generated from the model should be from roughly the same distribution. This would be tricky, since the sampling method by its definition is random. Some possible ways are:

  1. For Bivariate class, compare the mean, variance, tail distribution etc.. There are also implementation in Matlab to be cross-checked on.
  2. For Multivariate class, assuming the algorithm for building the model is correct, generating samples and then use get_likelihood() to compute likelihood and verify the likelihood is reasonable.

Add option to accept scalars

Currently, our implementation of the statistical functions of copulas we are expecting and returning numpy.arrays. However, it could be useful to have this functionality to accept and return scalar values.

Matlab copulastat and copularnd equivalent

  • Copulas version: origin/master
  • Python version:
  • Operating System: Linux

Description

Not being a copula's expert this is just quick question more then an issue . May I use copulas to
fully emulate Matlab copulastat ? that is:

r = copulastat('Gaussian',rho) returns the Kendall’s rank correlation, r, that corresponds to a Gaussian copula with linear correlation parameters rho.

and the same apply to copularns:

u = copularnd('Gaussian',rho,n) returns n random vectors generated from a Gaussian copula with linear correlation parameters rho.

After searching it looks like there are a couple of packages, Copulas and copualib to deal with
Copula in python. Thus before starting working with one or the other would be good to have some feedback from the experts

thanks

What I Did

Make KDEUnivariate accept arrays as arguments.

Currently, all distributions and copulas accept as arguments arrays, usually numpy.array, with the exception of copulas.univariate.KDEUnivariate. We should change this behavior to match the rest of the library.

Add ability to handle constant data to univariate classes

  • Copulas version: 0.2.1
  • Python version: 3.6.1
  • Operating System: Ubuntu 18.04.1 LTS

Description

I was trying to fit a copulas.univariate.kde.KDEUnivariate with an array of constant data. I expected for it to work and be able to sample data ( altough I was supposing that the sampled values will be constant too).

What I Did

import numpy as np

from copulas.univariate import KDEUnivariate

X = np.array([1, 1, 1, 1])
kde = KDEUnivariate()
kde.fit(X)

and got the following traceback:

<ipython-input-2-6d5d418eb1ce> in <module>
      5 X = np.array([1, 1, 1, 1])
      6 kde = KDEUnivariate()
----> 7 kde.fit(X)

~/Pythia/MIT/Copulas/copulas/univariate/kde.py in fit(self, X)
     27             raise ValueError("data cannot be empty")
     28 
---> 29         self.model = scipy.stats.gaussian_kde(X)
     30         self.fitted = True
     31 

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in __init__(self, dataset, bw_method, weights)
    206             self._neff = 1/sum(self._weights**2)
    207 
--> 208         self.set_bandwidth(bw_method=bw_method)
    209 
    210     def evaluate(self, points):

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in set_bandwidth(self, bw_method)
    538             raise ValueError(msg)
    539 
--> 540         self._compute_covariance()
    541 
    542     def _compute_covariance(self):

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/stats/kde.py in _compute_covariance(self)
    550                                                bias=False,
    551                                                aweights=self.weights))
--> 552             self._data_inv_cov = linalg.inv(self._data_covariance)
    553 
    554         self.covariance = self._data_covariance * self.factor**2

~/.virtualenvs/copulas_mit/lib/python3.6/site-packages/scipy/linalg/basic.py in inv(a, overwrite_a, check_finite)
    972         inv_a, info = getri(lu, piv, lwork=lwork, overwrite_lu=1)
    973     if info > 0:
--> 974         raise LinAlgError("singular matrix")
    975     if info < 0:
    976         raise ValueError('illegal value in %d-th argument of internal '

LinAlgError: singular matrix

There is an issue open (#57) to fix a workaround on copulas.univariate.gaussian.GaussianUnivariate that avoid this exact situation, could we generalize its solution on copulas.univariate.base.Univariate to be able to model and sample constant data with all univariate distributions?

Update README

Readme should be updated before the release for PyPI, with the following:

  1. A section documenting the release process. See BTB for reference.
  2. Add pip install copulas as default installation method.
  3. List of copulas we currently support
  4. Data input expectations : numerical, perfectly clean

Reorganize dependencies

Remove the requirements.txt and requirements_test.txt files and list the dependencies only in setup.py.

requirements_dev.txt should be kept but it should install the test requirements as .[test]

Separate Bivariate Copulas

Bivariate Copulas should be separated into a class for each. Also the copula selector class copulas.bivariate.bv_copula should use inheritance to select one or another, instead of if statements

Fix DeprecationWarnings

When fitting a GaussianCopula the following warning is raised:

copulas/multivariate/GaussianCopula.py:64: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
means = [np.mean(res.iloc[:, i].as_matrix()) for i in range(n)]
numpy/lib/function_base.py:3103: RuntimeWarning: invalid value encountered in subtract
X -= avg[:, None]
copulas/multivariate/GaussianCopula.py:66: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
return (cov.as_matrix(), means, res)

Add from_theta and from_tau classmethods on Bivariate

Currently, the only supported behavior for Bivariate copulas is the following:

  1. Instantiate the class of the desired family.
instance = Bivariate('frank')
  1. Fit it with data, and set the internal parameters tau and theta.
instance.fit(X)

assert instance.tau is not None
assert instance.theta is not None
  1. Use the instance methods to access the statistical functions of the copula family for given parameter theta computed in the instance
instance.sample(5)
instance.cdf(W)

Considering that all of the pdf, cdf, ppf sample use only the theta parameter and that the tau parameter is only used to compute theta, we could add the following methods:

  • from_tau: A classmethod that receives the tau parameter, create a new instance, compute and set the theta parameter, set it as fitted and return it.

  • from_theta: A classmethod that receives the theta parameter, create a new instance, set theta and fitted attributes and return it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.