novonordisk-research / processoptimizer Goto Github PK

View Code? Open in Web Editor NEW

36.0 7.0 12.0 41.32 MB

A tool to optimize real world problems

Home Page: https://github.com/novonordisk-research/ProcessOptimizer

License: Other

Python 36.21% Jupyter Notebook 63.79%

bayesianoptimization optimization

processoptimizer's People

Contributors

Stargazers

Watchers

Forkers

akselobdrup akashbest yanshouyu dk-teknologisk-mon abbl-dti teknologisk-institut dk-teknologisk-rtfh sqbl natashavf schroederdk runechristensen-nn keshavaspanda

processoptimizer's Issues

Different versions

Investigate options for having long-term-stable releases (pinned versions of dependencies) and cutting edge version (with free dependencies)

Expected_minimum should return standard deviation

Include a flag in expected_minimum to return standard deviation from the model in the minimum

standard deviation can be found by asking for it in model.predict(x, return_std=True)

continue work in #68

QoL improvement, pytest warnings

We need to start working on some of the deprecation warnings in the test-suite

Speed-up of convergence/metrics plotting

Right now plot_expected_minimum_convergence needs to "replay" the full optimization for plotting convergence plots.
The information of the current model status through an optimization process is needed for the extraction of the expected minima + uncertainties.
The necessary information could also be extracted from the models attribute within the optimizer object which is extended on each call of tell(). Right now expected_minimum() only uses the latest model from said models attribute.
Therefore expected_minimum() could be adapted accordingly or a separate helper function could be introduced.

I see that this approach fails as soon as multiple points at once are used within tell(). For this case one could think of storing information about how many points were handed over to tell() and only replay optimizations for the cases where it is necessary.

Update Python dependencies

We should stop supporting python 3.6 (and perhaps also 3.7) and instead support Python 3.10 (and soon-ish 3.11)

Merging the arguments strategy and space_fill in the ask() function

The ask() function currently has two arguments that can be merged:

"strategy" determines which strategy is used to sample points after the first point is chosen with the acquisition function.

"space_fill" is used when the user eg. exclusively wants Steinerberger points (and no points using the acqusition function).

Instead a solution is proposed, where strategy can take the two values "stbr_fill" and "stbr_full". With the "stbr_fill" strategy Steinerberger points are used after the first point has been chosen. With the "stbr_full" strategy" all points are chosen using the Steinerberger strategy.

Other type of error in tests (still relating to multiobjective)

One family of test errors relate specifically to the use of EIps and PIps. These are tracked (and hopefully solved) by issue #18 .
This issue tracks a related error. One such example from pytest is:

_____________ test_minimizer_api_dummy_minimize[call_single-True] _____________

verbose = True, call = <function call_single at 0x00000200E14054C0>
@pytest.mark.fast_test
@pytest.mark.parametrize("verbose", [True, False])
@pytest.mark.parametrize("call",
                         [call_single, [call_single, check_result_callable]])
def test_minimizer_api_dummy_minimize(verbose, call):
    # dummy_minimize is special as it does not support all parameters
    # and does not fit any models
    n_calls = 7
    result = dummy_minimize(branin, [(-5.0, 10.0), (0.0, 15.0)],
                            n_calls=n_calls, random_state=1,
                            verbose=verbose, callback=call)

    assert result.models == []
    check_minimizer_api(result, n_calls)
    check_minimizer_bounds(result, n_calls)
    with pytest.raises(ValueError):
      dummy_minimize(lambda x: x, [[-5, 10]])
ProcessOptimizer\tests\test_common.py:106:

ProcessOptimizer\optimizer\dummy.py:92: in dummy_minimize
return base_minimize(func, dimensions, base_estimator="dummy",
ProcessOptimizer\optimizer\base.py:261: in base_minimize
result = optimizer.tell(next_x, next_y)
ProcessOptimizer\optimizer\optimizer.py:558: in tell
return self._tell(x, y, fit=fit)
ProcessOptimizer\optimizer\optimizer.py:743: in _tell
return create_result(self.Xi, self.yi, self.space, self.rng,

Xi = [[-5]], yi = array([[-5]], dtype=int64)
space = Space([Integer(low=-5, high=10)])
rng = RandomState(MT19937) at 0x200C7E12740, specs = None, models = []
def create_result(Xi, yi, space=None, rng=None, specs=None, models=None):
    """
    Initialize an `OptimizeResult` object.

    Parameters
    ----------
    * `Xi` [list of lists, shape=(n_iters, n_features)]:
        Location of the minimum at every iteration.

    * `yi` [array-like, shape=(n_iters,)]:
        Minimum value obtained at every iteration.

    * `space` [Space instance, optional]:
        Search space.

    * `rng` [RandomState instance, optional]:
        State of the random state.

    * `specs` [dict, optional]:
        Call specifications.

    * `models` [list, optional]:
        List of fit surrogate models.

    Returns
    -------
    * `res` [`OptimizeResult`, scipy object]:
        OptimizeResult instance with the required information.
    """
    res = OptimizeResult()
    yi = np.asarray(yi)
    if np.ndim(yi) == 2:
      res.log_time = np.ravel(yi[:, 1])
E IndexError: index 1 is out of bounds for axis 1 with size 1

ProcessOptimizer\utils.py:60: IndexError
---------------------------- Captured stdout call -----------------------------
Iteration No: 1 started. Evaluating function at random point.
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 0.0010
Function value obtained: 123.3261
Current minimum: 123.3261
Iteration No: 2 started. Evaluating function at random point.
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 0.0000
Function value obtained: 8.6117
Current minimum: 8.6117
Iteration No: 3 started. Evaluating function at random point.
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 0.0000
Function value obtained: 18.0847
Current minimum: 8.6117
Iteration No: 4 started. Evaluating function at random point.
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 0.0000
Function value obtained: 44.2546
Current minimum: 8.6117
Iteration No: 5 started. Evaluating function at random point.
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 0.0010
Function value obtained: 112.0112
Current minimum: 8.6117
Iteration No: 6 started. Evaluating function at random point.
Iteration No: 6 ended. Evaluation done at random point.
Time taken: 0.0000
Function value obtained: 21.0686
Current minimum: 8.6117
Iteration No: 7 started. Evaluating function at random point.
Iteration No: 7 ended. Evaluation done at random point.
Time taken: 0.0000
Function value obtained: 9.3015
Current minimum: 8.6117

Sorry for ugly formatting. In short, I believe that the if-statement in: line 59 in utils.py is not handled correctly given that we might have multidimensional y-dimension.

Suggestions for fix?

2D-partial dependence plots to use same color scale

By default, the 2D-partial dependence plots each use their own color scale, going from the lowest to the highest value that occurs in the plot. This is potentially misleading to the end user, as all factors will appear to have a significant effect when in practice this is not the case (as one might surmise from the 1D-partial dependence plots that are shown alongside).

More Flake8 work

We need further focus on removing Flake8 errors. And finally Flake8 could/should be part of Github Actions for automated code review

missing dependency pyYAML

When doing a new install of processoptimizer using python 3.9.2. all test cases fails due to missing dependency yaml

Alternatives for the convergence plot

Deciding when to stop the optimization proces is a hard problem that involves a trade-off between the added cost of doing more experiments and the (percieved) rate of improvement per experiment. At present, the "convergence plot" that displays the best observation vs. experiment number is not really a good tool to aid in this process, among other things because it does not indicate how certain we are about the present minimum.

An alternative plot to aid the user could be to display the value of the expected minimum and the confidence interval of the value as a function of experiment number. As one covers more and more of parameter space with experiments, the expected minimum will (hopefully) improve, while at the same time being somewhat uncertain. If the global optimum is found, then as more and more experiments are run the expected minimum will stop improving, while the uncertainty of the minimum value will decrease.

Another possible diagnostic plot could be to plot how much the expected minimum is moving (using some normalized metric) as a function of experiment number. Again, when a parameter space is sparsely sampled, the expected minimum is going to move a lot at first, and then (hopefully) "settle down" into a minimum.

Test performance of NSGAIII vs NSGAII

See https://www.hindawi.com/journals/sp/2020/4653204/ for a discussion on differences.
Correction could be very simple, around here:

ProcessOptimizer/ProcessOptimizer/optimizer/_NSGA2.py

Line 58 in 7982f79

toolbox.register("select", tools.selNSGA2)

Errors relating to iid parameter in BayesseachCV

Same issue as scikit-optimize/scikit-optimize#978

Dependency pruning

Can we remove any dependencies? For instance nose?

Remember to look into value in test of space

Look at

ProcessOptimizer/ProcessOptimizer/tests/test_space.py

Line 374 in 7982f79

assert isinstance(X_orig, np.int32) #This should be changed to return int64

Pareto implementation causes issues when fitting a model without data

In PO:0.6.2, sklearn:0.24.2,

Pareto is causing an issue when asking for more points than n_initial_points (at least while model is not build with more than 0-1 datapoints.

Code to reproduce:
`
import matplotlib.pyplot as plt
import numpy as np
#import pandas as pd
import ProcessOptimizer as po
from ProcessOptimizer.plots import plot_objective, plot_evaluations
from ProcessOptimizer.utils import cook_estimator, create_result
from ProcessOptimizer.learning.gaussian_process.kernels import Matern, ConstantKernel
from ProcessOptimizer import expected_minimum
np.random.seed(42)

print(po.version)

space = [(-150.,50.),
(25.,60.)]
space_names= ['CPU Offset Voltage', 'Turbo Boost long power max']

kernel_intern = 1**2 * Matern(length_scale = np.ones(len(space)),
length_scale_bounds=[0.1,3.],
nu=2.5)
base_est = cook_estimator(base_estimator='GP',
space=space,
noise = 'gaussian',
kernel = kernel_intern,
normalize_y=True)

opt = po.Optimizer(dimensions=space,
base_estimator=base_est,
n_initial_points=5, #After the initial 15 exp's, I plan to perform standard settings as exp#16
lhs=True,
acq_func='EI',
acq_func_kwargs={'xi':0.01},
n_objectives=2)

print(opt.ask(10))
`

Expected result: 10 points without any error.
Result: C:\Users\soren\Anaconda3\envs\PO20210522\lib\site-packages\ProcessOptimizer\optimizer\optimizer.py:1030: RuntimeWarning: invalid value encountered in true_divide return ArgMax, (MinDist-Mean)/(Std)

Normalize_y=False to avoid problem with n_initial_points being asked

As in scikit-optimize/scikit-optimize#947 ...
...and in scikit-learn/scikit-learn#18371

... we risk getting an error message that is difficult to understans and hence to fix. Problem relates to a divition with zero when applying constantLiar method to ask for several experiments, when the current knowledge is zero (or one) point(s).

I suggest that we just change "normalize_y" to False (as preset value) in cook_estimator:

ProcessOptimizer/ProcessOptimizer/utils.py

Line 354 in 8675ca0

normalize_y=True, noise="gaussian",

Return of a result is influenced by multiobjective opt

During multiobjective running a list of models is kept in ProcessOptimizer. This works for pareto-type plotting, but "normal" ProcessOptimizer plotting is failing.

Potential quickfix (not tested): Change in get_result (line 845 in optimizer.py) to return a free choice of model 1 or model 2 (given a 2d multiobjective opt). Then instruct user in creating a "result" instance in this way after exprimentation and before plotting...

Flake8 compliance for >95% of code

Perform Flake analysis and correct code in accordance with result

Expected minimum with categorical dimensions

Currently, the code can only call expected minimum in cases without categorical dimensions. Yet, for plotting purposes, we do have code to guess for expected minimum (also with categorical dimensions) by "simple" scattering of test points.
Could be nice for integrate the plotting code to also be call'able for normal use-cases.

Tests

A number of subtests in test_space.py is commentet out. I think the "only" change that should be done is to remove the dtype flag in the calls.

Default LHS=True for initial_points

Implement a default setting of LHS=True for all new optimizations. Ensure no errors in test suite

Convergence metrics/plottings

We have a few plots to help the user in deciding whether to stop the experimentation or to continue in the search of an even better minimum.
In the plot_expected_minimum_convergence (

ProcessOptimizer/ProcessOptimizer/plots.py

Line 912 in a833b9b

def plot_expected_minimum_convergence(

), we should likely remove the lower part of the plot or make it independent from input-features of little importance.
Explanation: The lower part of the plot, shows the euclidian distance between the last and the current "recipe" for the expected_minimum. In cases, where individual input-features have little or no effect on the expected_minimum, then a relatively large euclidian distance between two equally well performing minima will show up on the graph. This is unwanted behaviour.

Furthermore, let's sanity check that the euclidian distance is calculated in transformed space.

Show model uncertainty on 2D plots

If we could "gray out" the expected value for uncertain results in the 2D plots for "plot_objective()", we could communicate the properties of the model better to the users.

Reset Normalize_y = True

normalize_y is temporarily set to "=False" to solve issue in #5

When SKLEARN have implemented changes as discussed in scikit-learn/scikit-learn#18388, we can reset the normalize_y to the old value of True

DEAP is dependency of NSGA (pareto front)

Investigate if other new dependencies have been added and make corresponding changes to setup.py, requirements.txt

Expected_minimum with both continuous and categorical dimensions

For every combination of categorical dimensions, allow gradient based minization of the remaining continuous dimensions.

Test performance relative to expected_minimum_random

If better, include in logical choice of expected minimum functions.

user_choice = ["gradient", "random", "hybrid"]
Implement as: expected_minimum(method = user_choice)

Bokeh fails with new refactored expected minimum sampling

As the title says. With the current github version, bokeh will fail in importing expected_min_random_sampling because it has been moved and renamed

A metric to indicate model sensitivity towards individual data points

In the real world, mistakes happen during experiments and measurements. This means that sometimes your data does not actually represent the settings you believe. In the domain of frequentist statistics (that we have left to use this tool) one would use things like normal-probability plots and so on to detect outliers, but such a plot is not meaningful for a Bayesian model (as far as I know).

A different metric related to each data point that may be transferable is DFFITS, which stands for "difference in fits". It is calculated as by measuring the change in predicted values that occurs when that data point is deleted. The larger the value of DFFITS, the more that data point influences the fitted model. The metric has a concrete math-y definition that you can look up, but the point is that if the location of your expected minimum hinges on one data point, then you better be sure that data is valid (or that you gather more data before you stop your optimization proces).

Observation: When n_objectives > 1, we cannot enter training data as list

When working with n_objectives > 1, we cannot enter multiple training data-sets as a list.
The easy workaround is to feed the existing data stepwise
for i in range(len(X)): opt.tell(X[i], y[i])

Whitekernel, noise bounds

While modelling very few or noisy data, there is a risk of "underfitting" the model, while placing all signal as noise. This is believed to be linked to the whitekernel that is added to the regressor-kernel.
This might be solved by looking into the parameters set in the whitekernel. E.g the noise-bound can be reduced from the curren broad bands 1e-5 to 1e5 to something more sensible - although this might be problematic in cases where the y-axis is not normalized.

This might be part a wider discussion around the overall choice of default kernel to "Constant * Matern + White"

More user settings for plot_objective()

Let the user select colormap and interpolation type in the 2D plots for the objective function

Bokeh_plot Bug

Changes in "dependence"-calculations have broken the bokeh plot. Quick-fix could be to catch the newly returned std as a dummy variable.

Do we need init.py?

Question, Are we a namespace package or an old-type package?
Read more:
https://docs.python.org/3/reference/import.html#regular-packages

flake8 corrections

Serveral deprecation warnings from np in test_parralel_cl.py

Unexpected LHS behavior

Hi,

import ProcessOptimizer as po

opt = po.Optimizer([('0.05', '0.1', '0.15', '0.2', '0.25', '0.3'), (1,2)], n_initial_points=12, lhs=True)
next_x = opt.ask(12)
print(next_x)

I would have expected to receive 12 different experiments (two complete sets with each of the categorical values represented).

Instead, I received:
[['0.25', 1], ['0.2', 1], ['0.3', 1], ['0.2', 2], ['0.3', 2], ['0.1', 1], ['0.15', 2], ['0.15', 1], ['0.05', 2], ['0.05', 2], ['0.25', 1], ['0.1', 2]]

As seen, there are categorical values being repeated in each of the sets (e.g. two ['0.05', 2] and two ['0.25', 1].

LHS works as expected in this case:

import ProcessOptimizer as po
import numpy as np


for n in range(100):
    opt = po.Optimizer([(1,20)], n_initial_points=20, lhs=True)
    next_x = np.asarray(opt.ask(20)).reshape(1,-1)[0]
    print(f'Number of unique suggestions: {len(np.unique(next_x))}.')

Log axis for plot_objective does not effect colorbar

Suggested fix: Remove option for log-scales in plot_objective

Figures not displaying on PyPI page

The figures in the "How does it work" section on the ProcessOptimizer PyPI page are not displaying correctly.

A potential solution is to change the references to the figures in README.md to URLs instead of paths.

Additions to documentation

Consider making use of github guidelines for making guides to new contributors. As seen here.

Feature request

How about going through all the list of imported functions/classes in init.py? The purpose would be to prune the unused ones and add those that are actively being used.
In this way, we can avoid many lines of import-statements in our scripts and instead do something like:
Import ProcessOptimizer as po
opt = po.Optimizer
...
po.plot_objective(result)

LHS sampling is different if limits are continuous or discrete

The points returned by the latin hypercube sampling differs, depending on whether the min/max limits are defined as integers or reals. An example that illustrates it is the following code:

from ProcessOptimizer import Optimizer
n_init = 5
space1 = [(0,100),(0,100),(0,100)]
space2 = [(0.0,100.0),(0.0,100.0),(0.0,100.0)]
opt1 = Optimizer(space1,n_initial_points=n_init)
opt2 = Optimizer(space2,n_initial_points=n_init)
opt1.ask(5)
opt2.ask(5)

op1 returns:
[[25, 75, 0], [100, 50, 50], [50, 25, 25], [0, 0, 75], [75, 100, 100]]
while opt2 returns:
[[30.0, 70.0, 10.0], [90.0, 50.0, 50.0], [50.0, 30.0, 30.0], [10.0, 10.0, 70.0], [70.0, 90.0, 90.0]]

In my opinion, these lists should be the same within rounding. The difference is caused by different definitions of lhs_arange for the class Real and the class Integer in space.py. In the case of Real the points are spaced equally, with a half-step buffer to the limit values like this: a = (np.arange(n)+0.5)/n, while Integers do not include the buffer at the limits: rounded_numbers = np.round(np.linspace(self.low, self.high, n)).

I'll work on a fix for the integer class.

Implement Kriging believer for batch opt

According to the Abigail Doyle Nature paper, Kriging believer performed well for batch optimization (optimization in which, we ask for n>1 experiments pr round). It could be fun to implement and test in a comparison between "constant liar", "Steinerberger", "random" and "Kriging believer".

opt.get_result() is not well described for multiobjective optimization

As the title says... In a multiobjective optimization, it is not intuitive (nor logical) that opt.get_result() only returns the first objective model

I would propose a fix in which the code either just returns a warning that get_result is not implemented for multiobjective, or that the code returns a list of results. I would prefer the latter solution 😁

This also ties in with how we are generally handling the Optimizer instance during multiobjective optimization - like the connections with functions such as expected_minimum or plot_objective

flake8 corrections

Addition of Pareto plot with bokeh

An interactive plot for multi-objective optimizations directly exported as html code would be nice.

I propose to use the bokeh module for the task as this is already a dependency for ProcessOptimizer.

Expected_minimum given a set of constraints

It would be useful to put a contraints flag on the expected_minimum function.

Matplotlib Deprecation Warning in example notebook

The notebook with plot branin https://github.com/novonordisk-research/ProcessOptimizer/blob/develop/examples/visualizing-results.ipynb is giving a dprecationwarning
<ipython-input-15-9f6f868b015d>:10: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3. Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading']. This will become an error two minor releases later. cm = ax.pcolormesh(x_ax, y_ax, fx,
Passing shading='auto', resolves. As in:
cm = ax.pcolormesh(x_ax, y_ax, fx,shading='auto', norm=LogNorm(vmin=fx.min(), vmax=fx.max()))

Fix issues in test suite caused by EIps and PIps

The introduction of multiobjective optimization #1 led to issues with test suite. Specifically, those relating to EIps and PIps. As neither of these acquisition functions are relevant to optimization of real-world processes, we will suggest to remove the tests relating to those acquisition functions. This should prob'ly be followed by an Issue aimed at also removing the actual acquisition functions to avoid confusion later.

opt.ask() multiobjective provides error when passing n_points > 1

ProcessOptimizer version 0.6.0

space = [(5.0, 20.0), (0.5, 2.0), (10.0, 50.0),(5.0, 20.0),
(3.0, 5.0), (4.0, 10.0), (6.0, 16.0), (5.7, 6.7),
(25.0, 37.0)]

opt = Optimizer(space, n_initial_points = 5, base_estimator='GP',
acq_func='EI', acq_func_kwargs = {"xi":0.1}, n_objectives=2)

""" must opt.tell data more than n_initial_points """

print(opt.ask(n_points = 2))

ERROR:

Traceback (most recent call last):

File "", line 27, in
print(opt.ask(n_points=2))

File "C:\Users\roht\AppData\Local\Continuum\anaconda3\lib\site-packages\ProcessOptimizer\optimizer\optimizer.py", line 441, in ask
np.iinfo(np.int32).max))

File "C:\Users\roht\AppData\Local\Continuum\anaconda3\lib\site-packages\ProcessOptimizer\optimizer\optimizer.py", line 324, in copy
optimizer._tell(self.Xi, self.yi)

File "C:\Users\roht\AppData\Local\Continuum\anaconda3\lib\site-packages\ProcessOptimizer\optimizer\optimizer.py", line 651, in _tell
pop, logbook, front = self.NSGAII()

File "C:\Users\roht\AppData\Local\Continuum\anaconda3\lib\site-packages\ProcessOptimizer\optimizer\optimizer.py", line 1042, in NSGAII
MU=MU)

File "C:\Users\roht\AppData\Local\Continuum\anaconda3\lib\site-packages\ProcessOptimizer\optimizer_NSGA2.py", line 72, in NSGAII
for ind, fit in zip(invalid_ind, fitnesses):

File "C:\Users\roht\AppData\Local\Continuum\anaconda3\lib\site-packages\ProcessOptimizer\optimizer\optimizer.py", line 985, in _ObjectiveGP
F[i] = self.models[(len(self.yi)- self.n_initial_points)][i].predict(xx)[0]

IndexError: list index out of range

novonordisk-research / processoptimizer Goto Github PK

processoptimizer's People

Contributors

Stargazers

Watchers

Forkers

processoptimizer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs