GithubHelp home page GithubHelp logo

opheliamiralles / pykelihood Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 2.0 199 KB

package useful for likelihood-based inference

License: MIT License

Python 100.00%
profiling likelihood statistics hawkes-process simulation-modeling

pykelihood's Introduction

pykelihood

pre-commit

About

Pykelihood is a Python package for statistical analysis designed to give more flexibility to likelihood-based inference than is possible with scipy.stats. Distributions are designed from an Object Oriented Programming (OOP) point of view.

Main features include:

  • use any scipy.stats distribution, or make your own,
  • fit distributions of arbitrary complexity to your data,
  • add trends of different forms in the parameters of any distribution,
  • condition the log-likelihood with any form of penalty,
  • profile parameters with a penalised log-likelihood,
  • more to come...

Installation

Using pip

pip install pykelihood

From sources

git clone https://www.github.com/OpheliaMiralles/pykelihood

or

gh repo clone OpheliaMiralles/pykelihood

Usage

Basics

The most basic thing you can use pykelihood for is creating and manipulating distributions as objects.

>>> from pykelihood.distributions import Normal
>>> n = Normal(1, 2)
>>> n
Normal(loc=1.0, scale=2.0)

n is an object of type Normal. It has 2 parameters, loc and scale. They can be accessed like standard Python attributes:

>>> n.loc
1.0

Using the Normal object, you can calculate standard values using the same semantics as scipy.stats:

>>> n.pdf([0, 1, 2])
array([0.17603266, 0.19947114, 0.17603266])
>>> n.cdf([0, 1, 2])
array([0.30853754, 0.5       , 0.69146246])

Or you can also generate random values according to this distribution:

>>> n.rvs(10)
array([ 3.31370986,  5.02699468, -0.3573229 ,  1.00460378, -3.26044871,
        1.86362711, -0.84192901,  0.81132182, -2.03266978,  1.48079944])

Fitting

Let's generate a larger sample from our previous object:

>>> data = n.rvs(1000)
>>> data.mean()
1.025039359276458
>>> data.std()
1.9376460645596842

We can fit a Normal distribution to this data, which will return another Normal object:

>>> Normal.fit(data)
Normal(loc=1.0250822420920338, scale=1.9376400770300832)

As you can see, the values are slightly different from the moments in the data. This is due to the fact that the fit method returns the Maximum Likelihood Estimator (MLE) for the data, and is thus the result of an optimisation (using scipy.optimize).

We can also fix the value for some parameters if we know them:

>>> Normal.fit(data, loc=1)
Normal(loc=1.0, scale=1.9377929687500024)

Trend fitting

One of the most powerful features of pykelihood is the ability to fit arbitrary distributions. For instance, say our data has a linear trend with a very little gaussian noise we would like to capture:

>>> import numpy as np
>>> data = np.linspace(-1, 1, 365) + np.random.normal(0, 0.001, 365)
>>> data[:10]
array([-0.99802364, -0.99503679, -0.98900434, -0.98277981, -0.979487  ,
       -0.97393519, -0.96853445, -0.96149152, -0.95564004, -0.95054887])

If we try to fit this without a trend, the resulting distribution will miss out on most of the information:

>>> Normal.fit(data)
Normal(loc=-3.6462053656578005e-05, scale=0.5789668679237372)

Let's fit a Normal distribution with a trend in the loc parameter:

>>> from pykelihood import kernels
>>> Normal.fit(data, loc=kernels.linear(np.arange(365)))
Normal(loc=linear(a=-1.0000458359290572, b=0.005494714384381866), scale=0.0010055323717468906)

kernels.linear(X) builds a linear model in the form a + bX where a and b are parameters to be optimised for, and X is some covariate used to fit the data. If we assume the data were daily observations, then we find all the values we expected: -1 was the value on the first day, 0.05 was the daily increment (2 / 365 = 0.05), and there was a noise with std deviation 0.001.

Fitting with penalties

Another useful feature of pykelihood is the ability to customize the log-likelihood function with penalties, conditioning methods, stability conditions, etc. Most statistics-related packages offer to fit data using the standard opposite log-likelihood function, or in the best case, preselected models. To our knowledge, pykelihood is the only Python package allowing to easily customize the log-likelihood function to fit data.

>>> data = np.random.normal(0, 1, 1000)
>>> def lassolike_score(distribution, data):
...     return -np.sum(distribution.logpdf(data)) + 5 * np.abs(distribution.loc())
...
>>> std_fit = Normal.fit(data)
>>> cond_fit = Normal.fit(data, score=lassolike_score)
>>> std_fit.loc.value
-0.010891307380632494
>>> cond_fit.loc.value
-0.006210406541824357

Parameter profiling

Likelihood based inference relies on parameter estimation. This is why it's important to quantify the sensitivity of a chosen model to each of those parameters. The stats_utils module in pykelihood includes the Profiler class that allows to link a model to a set of observations by providing goodness of fit metrics and profiles for all parameters.

>>> from pykelihood.profiler import Profiler
>>> from pykelihood.distributions import GEV
>>> fitted_gev = GEV.fit(data, loc=kernels.linear(np.linspace(-1, 1, len(data))))
>>> ll = Profiler(fitted_gev, data, inference_confidence=0.99) # level of confidence for tests
>>> ll.AIC  # the standard fit is without trend
{'AIC MLE': -359.73533182968777, 'AIC Standard MLE Fit': 623.9896838880583}
>>> ll.profiles.keys()
dict_keys(['loc_a', 'loc_b', 'scale', 'shape'])
>>> ll.profiles["shape"].head(5)
      loc_a     loc_b     scale     shape   likelihood
0 -0.000122  1.000812  0.002495 -0.866884  1815.022132
1 -0.000196  1.000795  0.001964 -0.662803  1882.043541
2 -0.000283  1.000477  0.001469 -0.458721  1954.283256
3 -0.000439  1.000012  0.000987 -0.254640  2009.740282
4 -0.000555  1.000016  0.000948 -0.050558  1992.812843

Confidence intervals can be computed for specified metrics:

>>> def metric(gev): return gev.loc()
>>> ll.confidence_interval(metric)
[-4.160287666875364, 4.7039931595123825]

Contributing

Poetry is used to manage pykelihood's dependencies and build system. To install Poetry, you can refer to the installation instructions, but it boils down to running:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python

To configure your environment to work on pykelihood, run:

git clone https://www.github.com/OpheliaMiralles/pykelihood  # or any other clone method
cd pykelihood
poetry install

This will create a virtual environment for the project and install the required dependencies. To activate the virtual environment, be sure to run poetry shell prior to executing any code.

We also use the pre-commit library which adds git hooks to the repository. These must be installed with:

pre-commit install

Some parts of the code base use the matplotlib and hawkeslib package, but are for now not required to run most of the code, including the tests.

Tests

Tests are run using pytest. To run all tests, navigate to the root folder or the tests folder and type pytest.

pykelihood's People

Contributors

jonathanberthias avatar opheliamiralles avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pykelihood's Issues

Creating a distribution with constant trend parameters does not work

Creating a distribution with constant trend parameters does not work, for incompatibility issues between the covariate x type and ConstantParameter.

    """
    Making trend parameters constant somehow results in an error
    """
    x = np.array(np.random.uniform(size=200))
    y = np.array(np.random.normal(size=200))
    n = Normal.fit(y, loc=linear(x=x))
    #  now we create the same distribution in which the trend in the loc is fixed
    n1 = Normal(loc=linear(x=x, a=Parameter(n.loc.a), b=Parameter(n.loc.b)))
    n1.loc() # this line raises an error
    n2 = Normal(loc=linear(x=x, a=ConstantParameter(n.loc.a), b=ConstantParameter(n.loc.b)))
    n2.loc() # this line raises an error

When a parameter common to 2 different trends is fixed for one trend, it should be automatically fixed for the other.

When a parameter common to 2 different trends is fixed for one trend, it should be automatically fixed for the other.

"""
when 2 trends in the distribution parameters share a common parameter, e.g. alpha in the below example, making one of the corresponding trend parameter constant should automatically result in the other trend parameter is constant.
"""
x = np.array(np.random.uniform(size=200))
y = np.array(np.random.normal(size=200))
alpha0_init = 0.
alpha = pkl.parameters.Parameter(alpha0_init)
n = Normal.fit(y, loc=linear(x=x, b=alpha), scale=linear(x=x, b=alpha))
fixed_alpha = ConstantParameter(n.loc.b)  # should be equal to fit.scale.a as per problem1
fit_with_fixed_alpha = n.fit_instance(data=y, fixed_values={'loc_b': fixed_alpha})
assert(isinstance(fit_with_fixed_alpha.scale.b, ConstantParameter))
assert(fit_with_fixed_alpha.scale.b.value==fixed_alpha)

Use rvs from scipy directly if available

rvs is singled out from all other methods we inherit from scipy in distributions.py, let's change that so that we can use the random_state parameter from scipy and avoid code duplication.

Shared parameters for different trends should be calibrated as one single parameter

Shared parameters in trends for the distribution parameters should be optimized as one single parameter.

"""
when 2 trends in the distribution parameters share a common parameter, e.g. alpha in the below example, the fit results in
different values for the trends parameters that should be equal.
"""
x = np.array(np.random.uniform(size=200))
y = np.array(np.random.normal(size=200))
alpha0_init = 0.
alpha = pkl.parameters.Parameter(alpha0_init)
n = Normal.fit(y, loc=linear(x=x, b=alpha), scale=linear(x=x, b=alpha))
alpha1 = n.loc.b
alpha2 = n.scale.b
assert (alpha1 == alpha2)

Create wrapper for copulae Copulas

A milestone in fitting and profiling multivariate distribution would be to be able to create Multivariate distribution initialized with a correlation structure (e.g. Copula), the number of dimensions ndim and the ndim marginal distributions. We should be able to manage the fit of the composed likelihood function and compute profile likelihood with the flattened vector of parameters.

process_fit_params does not work on multiple layers

For complicated cases with kernels inside of kernels, or kernels inside of a distribution parameter (ex: Truncated distribution with a distribution that has a linear kernel in the loc parameter), the function does not work properly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.