bracketjohn / kerndisc Goto Github PK

Automatic Kernel Discovery for Gaussian Processes

Python 100.00%

gaussian-processes structure discovery machine-learning interpolation imputation

kerndisc's Introduction

kernDisc

kerndisc is a library for automated kernel structure discovery in univariate data. It aims to find the best composition of kernels in order to represent a time series. kerndisc currently possesses a test coverage of over 90 %. Still, there is no claim to correctness, contribution and correction is heavily desired.

It is thought to be useful to either:

Be used repeatedly on different time series of the same variable to discover some recurring, stable structure
Find a composited kernel that best describes a single time series for some variable

Search and description of kernels is heavily inspired by the PhD thesis of David Duvenaud et al., the Automated Statistician project and Lloyd et al..

In the future it is planned to bring down evaluation cost to O(n^2), by employing upper, lower bound estimation as introduced by Kim et al..

Currently, this library (development) is in idle mode, however, this is expected to change if there is any interest from the community in this.

Installation

kerndisc can be installed in the following way on mac:

> git clone https://github.com/BracketJohn/kernDisc
> cd kernDisc
> brew install pipenv
> pipenv install

brew can be substituted for other package maangers on non-mac systems. Afterwards you can spawn an interactive kerndisc session executing:

> pipenv shell
> cd src
> python

This will create a new virtual environment and enter it. From there one can start to develop. Although, I would recommend ipython or some similar, enhanced, development environment instead.

An usage example can be found below.

Usage

kerndisc can be used in the following way:

> import numpy as np
> from kerndisc import discover
> X, Y = np.array([0, 1, 2, 3]), np.array([-1, 1, -1, 1])
> discover(X, Y)
...
    Depth `2`: Empty search space, no new asts found.

{'periodic': {'ast': Node("/<class 'gpflow.kernels.Periodic'>", full_name='Periodic'),
  'depth': 0,
  'params': {'GPR/kern/variance': array(1.00037322),
   'GPR/kern/lengthscales': array(0.09897968),
   'GPR/kern/period': array(0.66666667),
   'GPR/likelihood/variance': array(1.00000004e-06)},
  'score': -11.34804081379194},
 'highscore_progression': [inf, -11.34804081379194, -11.34804081379194],
 'termination_reason': 'Depth `2`: Empty search space, no new asts found.'}

For scoring the following metrics are available:

Negative log likelihood (negative_log_likelihood),
bayesian information criterion (BIC, bayesian_information_criterion),
BIC modified to not take "irrelevant" parameters into account (Duvenaud et al., bayesian_information_criterion_duvenaud).

BIC is default, a metric can be selected by setting the environment variable METRIC. This can also be used to define custom metrics.

To populate the search space, i.e., the possible combinations of kernels that are explored, kerndisc uses a grammar from kerndisc.expansion.grammars.

It is also possible to define your own grammar for discovery and search space population.

Defining your own Metric

A new metric can be implemented in the kerndisc.evaluation.scoring._metrics module, afterwards it can be imported and added to the _METRICS dictionary in the packages __init__. Then it can be selected for training by setting the environment variable METRIC to its name.

All metrics MUST be minimization problems, i.e., be better when lower.

Defining your own Grammar

To define a new grammar, please create a new module in kerndisc.expansion.grammars called _grammar_*.py. This new module MUST offer:

expand_kernel: A method that takes a single gpflow kernel and applies desired alterations to it.
IMPLEMENTED_BASE_KERNEL_NAMES: A global List[str], which contains only BASE_KERNELS.keys() from _kernels.py. The base kernels in this list represent all kernels implemented by the respective grammar.

Once your custom grammar is created, you can select it by adding it to the _GRAMMARS dictionary in kerndisc.expansion.grammars.__init__.py and then setting the environment variable GRAMMAR to your grammars name.

See:

kerndisc.expansion.grammars.__init__.py for general concept and description,
kerndisc.expansion.grammars._grammar_duvenaud.py for an example of a grammar.

Development

pipenv is used for development. Please install it via pip if necessary. Usage:

> git clone https://github.com/BracketJohn/kernDisc
> cd kernDisc
> pipenv install --dev
> pipenv shell

This will install all necessary packages, create a new virtual environment and enter it. From there one can start to develop and test.

Testing

Tests can be executed by running the following:

> pytest

Depending on your environment, it might be necessary to do this in a pipenv shell.

kerndisc's People

Contributors

Stargazers

Watchers

kerndisc's Issues

Simplify Kernels and Remove redundancy

This also enables us to do textual description.

DoD

Assure that kerndisc never retries unnecessary models

Tasks

Create criterions on when a model can be evaluated:

        current_models = [model.simplified() for model in current_models]
        current_models = ff.remove_duplicates(current_models)

with simplified being defined as:

    def simplified(self):
        k = self.copy()
        k_prev = None
        while not k_prev == k:
            k_prev = k.copy()
            k = k.collapse_additive_idempotency()
            k = k.collapse_multiplicative_idempotency()
            k = k.collapse_multiplicative_identity()
            k = k.collapse_multiplicative_zero()
            k = k.canonical()
return k

Remove duplicates after simplification
remove duplicates that were already scored (#22)
Testing

Add CP and CW kernels

DoD

Have working CP Kernel
Have working CW Kernel

Tasks (#25)

Introduce simple graphing capabilities

Graphing is often necessary to see what's going on and manually tune search.

This has to be done manually every time currently, a feature for this might be nice.

Tasks

Evaluate whether this is a good new feature
Extend tasks here

Write scoring and training for `kerndisc`

DoD

Have working model evaluation
Have working training function

Tasks

Find way to score models
- Can be ML at first
Implement scoring
Implement training, invoke training as defined by gpflow
Determine parameter procedure

Think about converting to a truely bayesian env

Add simplification

DoD

Models are automatically simplified after extension

Tasks

Implement simplification of kernels with regard to:

    def simplified(self):
        k = self.copy()
        k_prev = None
        while not k_prev == k:
            k_prev = k.copy()
            k = k.collapse_additive_idempotency()
            k = k.collapse_multiplicative_idempotency()
            k = k.collapse_multiplicative_identity()
            k = k.collapse_multiplicative_zero()
            k = k.canonical()
return k

Write `grammar` functions/Kernel Expansion

DoD

have working grammar package
- This includes an expansion function

Tasks (#1)

Define grammar
Implement grammar
- Write tests

Keep Kernels alive during the whole procedure?

Research issue to check whether this might actually be a good idea

Tasks

Check whether this is feasible
Implement if feasible
Jot down results here

Find better naming for `SPECIAL_KERNELS`

SPECIAL_KERNELS are cp and cw kernels, currently accessible as SPECIAL_KERNELS, which doesn't make a lot of sense. They are closer to structural kernels or sth.

kerndisc alpha

This epic can be closed once kerndisc is ready for alpha.

Closing this will also move the transition of this project from using an internal, self management storyboard to an open source issue/participation style system.

Out Of Bound optimising functionality

DoD

Apply 0 1 like prior to oob optimization
Have documented somewhere what this means

Tasks

Look what AS does here
Write down their procedure
Implement
- Test

Create tasks necessary to be completed for `kernDisc`

DoD

Have all new tasks in kerndisc repository

Tasks

Break down kernDisc construction into sub tasks
Create sub tasks
Specify dod, tasks in new sub tasks
Go trough current automated-statistician
- jot down it's abilities
- specify subset to realize in this new, smaller lib

Sort is missing as part of simplification

This currently does not hold:

def test_simplify_order(are_asts_equal):
    """Test whether order is also irrelevant."""
    ast_one = Node(gpflow.kernels.Sum, full_name='Sum')
    Node(gpflow.kernels.RBF, parent=ast_one, full_name='rbf')
    Node(gpflow.kernels.Constant, parent=ast_one, full_name='constant')

    ast_two = Node(gpflow.kernels.Sum, full_name='Sum')
    Node(gpflow.kernels.Constant, parent=ast_two, full_name='constant')
    Node(gpflow.kernels.RBF, parent=ast_two, full_name='rbf')

    simpl_one = simplify(ast_one)
    simpl_two = simplify(ast_two)

    assert are_asts_equal(simpl_one, simpl_two)

To fix this:

sort has to be implemented,
conftest.py::are_asts_equal has to get the following addition:
```
     	if lvl_order_ast_one != lvl_order_ast_two:
 	        return False
```
This could be part of another if block, maybe we don't need to check order every time.

Add test for bad cholesky in `test_make_evaluator`

Tasks

Add test that catches singular matrix inversion during evaluation

Add stopping criterions

DoD

Have selectable stopping criterions for search

Tasks

Write Model selection and expansion loop

DoD

Have List of models to be trained and evaluated in each loop

Tasks (#22)

(1) Expand current model (#21)
~~- [ ] (2) Add random restarts (#10)~~
(3) Add jitter (#13)
(4) Re-add best models (#18)
(5) shuffle for comp. distribution

Add random restarts

DoD

Have random restarts functionality, if necessary

Tasks

Implement functionality
- Select n models at random to have them restart
  - initialise model with random standard deviation, likelihood
Write tests
Look further into:

def add_random_restarts(models, n_rand=1, sd=4, data_shape=None):
    new_models = []
    for a_model in models:
        for (kernel, likelihood, mean) in zip(add_random_restarts_single_k(a_model.kernel, n_rand=n_rand, sd=sd, data_shape=data_shape), \
                                              add_random_restarts_single_l(a_model.likelihood, n_rand=n_rand, sd=sd, data_shape=data_shape), \
                                              add_random_restarts_single_m(a_model.mean, n_rand=n_rand, sd=sd, data_shape=data_shape)):
            new_model = a_model.copy()
            new_model.kernel = kernel
            new_model.likelihood = likelihood
            new_model.mean = mean
            new_models.append(new_model)
return new_models

Implement own version of Duvenauds Grammar

There are some things that occurred during development and evaluation that might be improvements on the grammar of Duvenaud et al.
These improvements also don't seem to be covered in Lloyd et al.

It might be good to implement those in a modified version of duvenauds grammar, some small changes were already made that move away from Duvenauds design, these must then also be moved.

Tasks

Create new _grammar_duvenaud_modified.py grammar
Implement/Move over base kernel exclusion
(rationale: Stops GPs from just taking constant or white as the "best" result)
Maybe make initial full expansion part of this grammar? Current implementation as part of discover might be unreasonable

TODO: Add more improvements noticed during development here.

Remove nan scored

DoD

Go trough models scores after training and remove nan scores

Tasks

Implement this functionality
Test
Check whether AS does anything more than this -> it does not

Implement`additive_form` AS

DoD

Have method that simplifies kernels into additive form
Method should return canonical representations

Tasks

Implement additive form method
- Test
Add option to enable this
Reremove duplicates afterwords

Transition to `gpflow` Naming

Currently there is a kind of duality in the naming used internally:

gpflow has it's own naming scheme, which capitalizes kernel names. However they go as far as also using camel case (ArcCosine) and upper case (RBF). Currently this is then abstracted in kerndisc by the _kernels module, which maintains a dict of kernel_name.lower(): kernel_class mappings. This is useful for general coding, but also for grammar definition: The _grammar_duvenaud allows for kernels to be excluded from being a base_kernel. These can be passed by passing:

grammar_kwargs={'base_kernels_to_exclude': ['constant']}

as a parameter to discover. Here it is easier for a user to just use lower case kernel names, than remembering all the different casing options.

Whether this justifies deviation from gpflow is questionable though.

Implement Precedence Ordering in `describe`

Currently describe returns it's description in an arbitrary order for all sub components. This should be improved to take into account some kind of metric. Examples of this include:

Sorting by reduction of some error,
sorting by depth component was found (if this is identifiable),
....

AttributeError: module 'gpflow' has no attribute 'defer_build'

I'm not sure why this error is coming. gpflow installation has no issues

gpflow version: 2.0.3
tensorflow version: 2.2.0

Allow extension of `_kernels` wo breaking tests

Adding new kernels will currently break tests, as all baske kernels are tested with grammar duvenaud, eventhough it only supports a subset.

Avoid Overhead Issue described in https://github.com/GPflow/GPflow/issues/798

During search the tf search graph bloats to thousands of variables. This happens due to the default graph not being cleaned/reset and immensely reduces search speed over time.

Tasks

Generate a new graph for every evaluation
- Observe whether this fixes behavior

Move to truely bayesian approach

It might be better to use a Bayesian approach in combination with auto diff.

Tasks

Evaluate whether this improves performance
Implement Bayesian parameter initialization
Test this

`constant` kernel not available for expansion in grammar `duvenaud`

Is available in grammar string definition, but not in _IMPLEMENTED_KERNEL_EXPRESSIONS.

Fix install errors that are sometimes experienced

There seems to be lazy_load errors and deprecation warnings after a fresh installation right now. This is most likely fixed by addressing #44.

Will wait for more people to experience this before fixing, as this library is in low maintanence mode right now.

Add time series scaling to preprocessing

It seems to have an impact on a time series if it is very sparse and also stretches over a large time interval. Some of these examples include 39 data points over a duration of about 45k minutes (~75h).

This leads to:

kerndisc tending to find constant and white kernels (as an end result) to deal with this,
bad interpolations (function returns to some constant value in between pre conditioned points)

To tackle this behavior kerndisc should be able to rescale X to some interval during preprocessing.

This interval should still keep the same relative distance in between points.

Tasks

Implement rescaling option for preprocessing
- Allow to rescale to some arbitrary, desired interval
- Allow automatic rescaling (e.g., [0, ..., 49] for n=49 data points)

Remove duplicates

DoD

Automatically remove duplicates from search space

Tasks

Look into difference between redundancies, duplicated in AS
- Add more tasks here if there is a difference

multiple C kernels are ignored during simplification

Multiple RBF kernels are currently already being merged, while multiple constant kernels are being ignored. This hempers performance, but shouldn't change the outcome.

Have functioning prototype of `kerndisc`

DoD

Have functioning first version of kernDisc

Tasks

Finish all base tasks in this repo
Have this run successfully on some time series
- Make results of first run accessible

Add jitter

DoD

Apply nondeterministic jitter to models

Tasks

AS Does the following:

def add_jitter_k(kernels, sd=0.1):    
    '''Adds random noise to all parameters - empirically observed to help when optimiser gets stuck'''
    for k in kernels:
        k.load_param_vector(k.param_vector + np.random.normal(loc=0., scale=sd, size=k.param_vector.size))
    return kernels     

def add_jitter(models, sd=0.1):
    for a_model in models:
        a_model.kernel = add_jitter_k([a_model.kernel], sd=sd)[0]
return models

Implement here
- Test

Re-add best model(s) to search space

DoD

Have readd functionality
Observe its performance compared to usual performance

Tasks

It seems like AS never really used this functionality, as best_models is never set to anything else than None.

Implement this
Write tests
Evaluate its performance gain vs search time

        if not best_models is None:
            for a_model in best_models:
                current_models = current_models + [a_model.copy()] + ff.add_jitter_to_models([a_model.copy() for dummy in range(exp.n_rand)], exp.jitter_sd)

Update to newer gpflow version

The new gpflow version changed its kernel hierarchy. Now a Sum or Product kernel has a kernels child instead of all childs that are part of said Sum or Product. kernels can then be used to get children.

This is breaking for kernDisc and can currently not be supported.

Adapt kernels to insights of Lloyd et al.

Lloyd et al furthered kernel describability in his papar.

Read sections and alter kernels.

Add automatic textual description

AS implements the ABCD framework for textual description/explanation of kernels generated by it. We also need this for simplified kernel expressions.

Relies on #11

DoD

Have ability to describe a simplified kernel expression

Tasks

Add explanation package to explain and describe kernels
Have explanation syntax and texts as described by Duvenaud et al.

Users should be able to pass a grammar to `discover`

Currently a user has to add a new file to the grammars package and modify existing source code in order to create and use a new grammar.

This is not really "library like" and should be easier. Instead some method in kernDisc should be able to accept a method somewhere that is then used for expansion instead.

Wrapup of scoring and training functionality

DoD

After cleaning results (removing nan etc), select winner(s)
Update search space with these

Tasks

Implement the greedy search
Expand winners of search

Improve AST handling in `simplify`

Every sub function of simplify currently uses pythons deepcopy, in order to return an actual object.

It should be discussed whether this is necessary or whether hiding these lower levels of simplification API should be taken to full extend. This would mean creating a single deep copy in simplify and working on this copy afterwards.

Fix $ in Readme

Ugly stuff:

change paths to package names
replace $s

bracketjohn / kerndisc Goto Github PK

kerndisc's Introduction

kernDisc

Installation

Usage

Defining your own Metric

Defining your own Grammar

Development

Testing

kerndisc's People

Contributors

Stargazers

Watchers

kerndisc's Issues

DoD

Tasks

DoD

Tasks (#25)

Tasks

DoD

Tasks

DoD

Tasks

DoD

Tasks (#1)

Tasks

DoD

Tasks

DoD

Tasks

Tasks

DoD

Tasks

DoD

Tasks (#22)

DoD

Tasks

Tasks

DoD

Tasks

DoD

Tasks

Tasks

Tasks

Tasks

DoD

Tasks

DoD

Tasks

DoD

Tasks

DoD

Tasks

DoD

Tasks

DoD

Tasks

Recommend Projects

Recommend Topics

Recommend Org

Jobs