emdgroup / baybe Goto Github PK

View Code? Open in Web Editor NEW

183.0 9.0 31.0 3.5 MB

Bayesian Optimization and Design of Experiments

Home Page: https://emdgroup.github.io/baybe/

License: Apache License 2.0

Python 100.00%

bayesian-optimization design-of-experiments machine-learning active-learning experimental-design optimization

baybe's Introduction

Homepage • User Guide • Documentation • Contribute

BayBE — A Bayesian Back End for Design of Experiments

The Bayesian Back End (BayBE) provides a general-purpose toolbox for Bayesian Design of Experiments, focusing on additions that enable real-world experimental campaigns.

Besides functionality to perform a typical recommend-measure loop, BayBE's highlights are:

✨ Custom parameter encodings: Improve your campaign with domain knowledge
🧪 Built-in chemical encodings: Improve your campaign with chemical knowledge
🎯 Single and multiple targets with min, max and match objectives
⚙️ Custom surrogate models: For specialized problems or active learning
🎭 Hybrid (mixed continuous and discrete) spaces
🚀 Transfer learning: Mix data from multiple campaigns and accelerate optimization
📈 Comprehensive backtest, simulation and imputation utilities: Benchmark and find your best settings
📝 Fully typed and hypothesis-tested: Robust code base
🔄 All objects are fully de-/serializable: Useful for storing results in databases or use in wrappers like APIs

Quick Start

Let us consider a simple experiment where we control three parameters and want to maximize a single target called Yield.

First, install BayBE into your Python environment:

pip install baybe

For more information on this step, see our detailed installation instructions.

Defining the Optimization Objective

In BayBE's language, the Yield can be represented as a NumericalTarget, which we wrap into a SingleTargetObjective:

from baybe.targets import NumericalTarget
from baybe.objectives import SingleTargetObjective

target = NumericalTarget(
    name="Yield",
    mode="MAX",
)
objective = SingleTargetObjective(target=target)

In cases where we are confronted with multiple (potentially conflicting) targets, the DesirabilityObjective can be used instead. It allows to define additional settings, such as how these targets should be balanced. For more details, see the objective section of the user guide.

Defining the Search Space

Next, we inform BayBE about the available "control knobs", that is, the underlying system parameters we can tune to optimize our targets. This also involves specifying their values/ranges and other parameter-specific details.

For our example, we assume that we can control three parameters – Granularity, Pressure[bar], and Solvent – as follows:

from baybe.parameters import (
    CategoricalParameter,
    NumericalDiscreteParameter,
    SubstanceParameter,
)

parameters = [
    CategoricalParameter(
        name="Granularity",
        values=["coarse", "medium", "fine"],
        encoding="OHE",  # one-hot encoding of categories
    ),
    NumericalDiscreteParameter(
        name="Pressure[bar]",
        values=[1, 5, 10],
        tolerance=0.2,  # allows experimental inaccuracies up to 0.2 when reading values
    ),
    SubstanceParameter(
        name="Solvent",
        data={
            "Solvent A": "COC",
            "Solvent B": "CCC",  # label-SMILES pairs
            "Solvent C": "O",
            "Solvent D": "CS(=O)C",
        },
        encoding="MORDRED",  # chemical encoding via mordred package
    ),
]

For more parameter types and their details, see the parameters section of the user guide.

Additionally, we can define a set of constraints to further specify allowed ranges and relationships between our parameters. Details can be found in the constraints section of the user guide. In this example, we assume no further constraints.

With the parameter and constraint definitions at hand, we can now create our SearchSpace based on the Cartesian product of all possible parameter values:

from baybe.searchspace import SearchSpace

searchspace = SearchSpace.from_product(parameters)

Optional: Defining the Optimization Strategy

As an optional step, we can specify details on how the optimization should be conducted. If omitted, BayBE will choose a default setting.

For our example, we combine two recommenders via a so-called meta recommender named TwoPhaseMetaRecommender:

In cases where no measurements have been made prior to the interaction with BayBE, a selection via initial_recommender is used.
As soon as the first measurements are available, we switch to recommender.

For more details on the different recommenders, their underlying algorithmic details, and their configuration settings, see the recommenders section of the user guide.

from baybe.recommenders import (
    SequentialGreedyRecommender,
    FPSRecommender,
    TwoPhaseMetaRecommender,
)

recommender = TwoPhaseMetaRecommender(
    initial_recommender=FPSRecommender(),  # farthest point sampling
    recommender=SequentialGreedyRecommender(),  # Bayesian model-based optimization
)

The Optimization Loop

We can now construct a campaign object that brings all pieces of the puzzle together:

from baybe import Campaign

campaign = Campaign(searchspace, objective, recommender)

With this object at hand, we can start our experimentation cycle. In particular:

We can ask BayBE to recommend new experiments.
We can add_measurements for certain experimental settings to the campaign's database.

Note that these two steps can be performed in any order. In particular, available measurements can be submitted at any time and also several times before querying the next recommendations.

df = campaign.recommend(batch_size=3)
print(df)

   Granularity  Pressure[bar]    Solvent
15      medium            1.0  Solvent D
10      coarse           10.0  Solvent C
29        fine            5.0  Solvent B

Note that the specific recommendations will depend on both the data already fed to the campaign and the random number generator seed that is used.

After having conducted the corresponding experiments, we can add our measured targets to the table and feed it back to the campaign:

df["Yield"] = [79.8, 54.1, 59.4]
campaign.add_measurements(df)

With the newly arrived data, BayBE can produce a refined design for the next iteration. This loop would typically continue until a desired target value has been achieved in the experiment.

Advanced Example - Chemical Substances

BayBE has several modules to go beyond traditional approaches. One such example is the use of custom encodings for categorical parameters. Chemical encodings for substances are a special built-in case of this that comes with BayBE.

In the following picture you can see the outcome for treating the solvent, base and ligand in a direct arylation reaction optimization (from Shields, B.J. et al.) with chemical encodings compared to one-hot and a random baseline:

(installation)=

Installation

From Package Index

The easiest way to install BayBE is via PyPI:

pip install baybe

A certain released version of the package can be installed by specifying the corresponding version tag in the form baybe==x.y.z.

From GitHub

If you need finer control and would like to install a specific commit that has not been released under a certain version tag, you can do so by installing BayBE directly from GitHub via git and specifying the corresponding git ref.

For instance, to install the latest commit of the main branch, run:

pip install git+https://github.com/emdgroup/baybe.git@main

From Local Clone

Alternatively, you can install the package from your own local copy. First, clone the repository, navigate to the repository root folder, check out the desired commit, and run:

pip install .

A developer would typically also install the package in editable mode ('-e'), which ensures that changes to the code do not require a reinstallation.

pip install -e .

If you need to add additional dependencies, make sure to use the correct syntax including '':

pip install -e '.[dev]'

Optional Dependencies

There are several dependency groups that can be selected during pip installation, like

pip install 'baybe[test,lint]' # will install baybe with additional dependency groups `test` and `lint`

To get the most out of baybe, we recommend to install at least

pip install 'baybe[chem,simulation]'

The available groups are:

chem: Cheminformatics utilities (e.g. for the SubstanceParameter).
docs: Required for creating the documentation.
examples: Required for running the examples/streamlit.
lint: Required for linting and formatting.
mypy: Required for static type checking.
onnx: Required for using custom surrogate models in ONNX format.
simulation: Enabling the simulation module.
test: Required for running the tests.
dev: All of the above plus tox and pip-audit. For code contributors.

Telemetry

BayBE collects anonymous usage statistics only for employees of Merck KGaA, Darmstadt, Germany and/or its affiliates. The recording of metrics is turned off for all other users and is impossible due to a VPN block. In any case, the usage statistics do not involve logging of recorded measurements, targets/parameters or their names or any project information that would allow for reconstruction of details. The user and host machine names are irreversibly anonymized.

You can verify the above statements by studying the open-source code in the telemetry module.
You can always deactivate all telemetry by setting the environment variable BAYBE_TELEMETRY_ENABLED to false or off. For details please consult this page.

Authors

Martin Fitzner (Merck KGaA, Darmstadt, Germany), Contact, Github
Adrian Šošić (Merck Life Science KGaA, Darmstadt, Germany), Contact, Github
Alexander Hopp (Merck KGaA, Darmstadt, Germany) Contact, Github
Alex Lee (EMD Electronics, Tempe, Arizona, USA) Contact, Github

Known Issues

A list of know issues can be found here.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

baybe's People

Contributors

Stargazers

Watchers

baybe's Issues

Simulation bug in `ignore` mode

Hello! I get an error when trying to run simulate_experiment. The code errors here in simulation.py

        searchspace = campaign.searchspace.discrete.exp_rep
        missing_inds = searchspace.index[
            searchspace.merge(lookup, how="left", indicator=True)["_merge"]
            == "left_only"
        ]
        campaign.searchspace.discrete.metadata.loc[
            missing_inds, "dont_recommend"
        ] = True

The error is IndexError: boolean index did not match indexed array along dimension 0; dimension is 131220 but corresponding boolean dimension is 131227

I'm a bit confused since the error happens because the length of the missing_inds and campaign.searchspace.discrete.metadata don't match up, but they are both derived from campaign.searchspace.discrete?

What might be going on here?

Minor visual issues in the documentation

This issue is intended to serve as a place where minor visual issues regarding the documentation can be collected. Once enough such have been identified, they will be resolved in a corresponding PR.

Note in particular that this issue should not be used to collect major changes. Basically, if you see something in our documentation and think "Huh, that just looks a bit weird and would benefit from reformatting", then this is the place to just put it 😄

Expose underlying model of campaign

Hi, is it/would it be possible to expose the underlying model of a campaign in order to calculate predicted means and variances of a a set of new measurements (not necessarily those recommended by the campaign, but any user-specified measurement that exists within the search space)?

Ideally I'd like to be able to quantify the performance of the model on a set of known measurements as well.

Thanks!

Installation with Poetry fails with "Package 'baybe[telemetry]' is listed as a dependency of itself."

When trying to install baybe in an env with Poetry, it fails with the error Package 'baybe[telemetry]' is listed as a dependency of itself.

Steps to reproduce on MacOS:

poetry --version
> Poetry (version 1.8.2)
poetry init
# create a dummy package with Poetry, Python version ~3.11.0
cat pyproject.toml
> ...
> [tool.poetry.dependencies]
> python = "~3.11.0"
> 
> 
> [build-system]
> requires = ["poetry-core"]
> build-backend = "poetry.core.masonry.api"
> ...

poetry add baybe                                                               

> Using version ^0.8.1 for baybe
> 
> Updating dependencies
> Resolving dependencies... (0.3s)
> 
> Package 'baybe[telemetry]' is listed as a dependency of itself.

Docs: Alignment of member overview

this should be aligned left to look nicer

Python 3.12 blocked by failing config checks

The tests that ensure an invalid config is checked failes in python 3.12 (not in others)

seems related to cattrs

also, it is possible to write a config with recomEnDer (spelling mistake) and it will take a default instead of throwing an error

PR for Python 3.12 upgrade: #153

Re-writing Campaign User Guide

This issue is created to make everybody (specifically @Scienfitz and @AdrianSosic) aware of me starting to re-write the user guide for campaigns.
The corresponding discussion will be closed, and all comments regarding this issue that I should be aware of before opening the PR can be posted here.

User guide: Transfer Learning

The user guide for transfer learning needs to be written.

Cannot assign the following values containing duplicates to parameter X

Thanks for being such a cool abstraction for deisng experiments using the powerful Bayesian methods.

I have a set of experimental data that each contains various parameters. I am not sure it is by design, but in experiments it is natural that for multiple experiments that a parameter to hold similar values (duplicates), however, when defining the parameters either CategoricalParameter or NumericalDiscreteParameter, or NumericalContinuousParameter...I get the traceback error as an example for one of the parameters:
ValueError: Cannot assign the following values containing duplicates to parameter FeatureName: (1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6).

Estimate shape of search space?

Is there a way to calculate an upper bound for the shape of the search space? I'm trying to find a way to give the user feedback on the complexity of the search space and also how much memory needs to be allocated to proceed with the campaign and generate experiments.

Error in match mode when trying to take mean of bounds

Hi,

When running a simulation in match mode, the code fails when trying to generate results because it attempts to take a np.mean() of the Interval(), returning error TypeError: unsupported operand type(s) for: 'Interval' and 'int'

The error occurs on line 551 of simulation.py here

elif target.mode is TargetMode.MATCH:
    match_val = np.mean(target.bounds)

In the meantime just changing the line locally to

match_val = np.mean([target.bounds.lower, target.bounds.upper])

Assuming this is just from outdated code since you mentioned this was an old module in need of a refresh!

User guide: Surrogate

The user guide for surrogates needs an overhaul.

User guide: Constraints

The user guide for the constraints might need an overhaul but needs to checked more thoroughly.

Batch_size error

Hi,

I am getting this error "TypeError: Campaign.recommend() got an unexpected keyword argument 'batch_size'". I am attaching snapshot for your reference. Could you please help me to sort this error out because till yesterday it was working fine.

ModuleNotFoundError: No module named 'baybe.objectives'

Steps to reproduce:

pip install baybe
python -c "from baybe.objectives import SingleTargetObjective"

(from README)

i do see the module in the repo, but it appears it's not packaged in the version on PyPI?

Access to non-persistent data such as acqf functions and the overall model

Thank you for the wonderful tool！ I have a question about the architecture. Is there a way to add already observed experimental data to a campaign in experiments? Also, is there a way to check the average and variance of the recommended experimental points and exploration space, as well as the evaluation values of the acquisition function in a dataframe？

Mypy for targets

Return incomplete results when simulation errors out

Hi! I was wondering if it would be possible to add a feature where simulations will still return the results compiled up to the point of an error?

The situation I'm running into when running on larger datasets is a botorch error to the tune of
All attempts to fit the model have failed.

I am in the process of troubleshooting what about the dataset is causing the failure, but in the meantime it would be nice to see the results up to that point, which should include dozens of batches of experiments.

Also, if you have any experience with what might be causing an error like this, that would be helpful!

Referring to this comment in a botorch thread: pytorch/botorch#1226 (comment)
I initially wondered if this could be my issue, but baybe should prevent this from being an issue since it identifies duplicate parameter values and randomly picks one.

User guide: Objective

The user guide for objectives needs an overhaul.

User guide: Search spaces

The user guide for search spaces needs an overhaul.

User guide: Strategy

The user guide for strategies needs an overhaul.

Handling of Infinity in serialization

When de-serializing Campaign objects, we sometimes have to deal with inf values (for example, as boundaries to the objective target). When these values are serialized to JSON, they are replaced by Infinity objects. Infinity objects are actually Javascript objects, but are not supported by JSON specifications (https://web.archive.org/web/20160414190115/http://tools.ietf.org/html/rfc4627).

Python's JSON library has no problem transforming back and forth between JSON and dict objects when Infinity is present. However, some other solutions that utilize JSON (for example, MongoDB) do not understand Infinity and cast the value to null. There could be potential problems for other downstream consumers that strictly adhere to the JSON specifications.

Here are some resources discussing the issue and potential solutions.:
https://medium.com/the-magic-pantry/infinity-and-json-cde6df62c17c
https://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript

General feedback - documentation, functionality, and software dev

I had a nice time reviewing this repository! Overall I think it's a really comprehensive, clean, and well-documented project. Thank you for open-sourcing it! Find below some questions and suggestions:

Docs improvements

General

It has a nice, clean look, both on the GitHub README and on the documentation site. This is really important I think.

README

A Colab notebook or similar would be really good I think. See e.g., https://colab.research.google.com/drive/1VEHXBLVkn5NZ7N-Oj6-dc_hkIfwFcUE-?usp=sharing. I needed to %pip install 'baybe[chem,simulation]' numpy==1.24.4 on Colab, otherwise it seems to work OK (see numpy/numpy#25150 (comment)). Consider moving "Quick Start" into a tutorial notebook and provide a Colab link. It looks like you already have it set up to convert Jupyter notebooks into html pages (e.g., https://emdgroup.github.io/baybe/examples/Constraints_Discrete/mixture_constraints.html)
It appears that the README example is doing only a single iteration. I would have expected to see an optimization loop and some information about best parameters, though I get that this is geared more towards wetlab scientists.
Maybe clarify in the README that people can choose from different scaling methods with a link to the docs? I eventually happened upon that part of the docs.
I think the detailed installation information could go into an "Advanced Installation" section (either a separate README that gets incorporated into the docs, or near the end of the README). Within the main README in a "quick installation" section, then include a link to the advanced installation instructions. See #95
The README example doesn't have much by way of outputs (e.g., print statements and expected output). See #95
Same for visual representation, such as an optimization trace using BayBE on a task. Are there any built-in visualization methods? If not, consider including at least some examples of visualizing performance

Webpage

It would be nice to have an "Edit on GitHub" link on your documentation pages -- it makes it a lot easier for others to contribute I think. See #94
It would be nice if the user guide linked to a corresponding tutorial or section of tutorials. For example, linking https://emdgroup.github.io/baybe/userguide/strategy.html to https://emdgroup.github.io/baybe/examples/Basics/strategies.html#
At a glance, this was difficult to parse: Similar to the SequentialStrategy, the StreamingSequentialStrategy enables the utilization of arbitrary iterables to select recommender. Note that this strategy is however not serializable. (https://emdgroup.github.io/baybe/userguide/strategy.html#the-streamingsequentialstrategy). I think I kind of get it, but not necessarily when or how I would want to use it.
I think there is too much granularity on some of your docs pages on pages like https://emdgroup.github.io/baybe/examples/Constraints_Discrete/Constraints_Discrete.html (i.e., lots of repeat, not a whole lot of valuable information gained from the bottom-most headings). No worries if this would be difficult to change.
It would be nice to get some more details about each of the "Examples" sections rather than needing to click into each one to better understand what it's about. I.e., https://emdgroup.github.io/baybe/examples/examples.html could have some text at the top.
As I'm going into more of the tutorials, I'm seeing that it's really comprehensive. For example, a demonstration of adding existing data https://emdgroup.github.io/baybe/examples/Backtesting/full_initial_data.html. I think there needs to be a better way to highlight/organize/point people to the tutorials they care about most. Happy to discuss more.

Terminology

Backtesting

I don't think "Backtesting" is common terminology for chem/materials informatics communities, at least in North America. It seems to be more common in finance, for example: https://en.wikipedia.org/wiki/Backtesting. When I wandered into https://emdgroup.github.io/baybe/_autosummary/baybe.simulation.html#module-baybe.simulation, I finally realized that what you refer to as simulation and backtesting is what I would typically refer to as benchmarking. I was thinking that maybe you implemented multi-task BO, where you could leverage physics-based simulations to help inform wetlab/experimental search campaigns. It took a while before this became clear to me.

Transfer learning

Right now, "Transfer learning: Mix data from multiple campaigns and accelerate optimization" is mentioned on https://emdgroup.github.io/baybe/misc/readme_link.html#, but it doesn't seem like this is really implemented yet, other than https://emdgroup.github.io/baybe/_autosummary/baybe.simulation.simulate_transfer_learning.html#baybe.simulation.simulate_transfer_learning. However, it doesn't appear to me that transfer learning is being used here. Even going through the function (https://emdgroup.github.io/baybe/_modules/baybe/simulation.html#simulate_transfer_learning), it was a bit tough to realize what was happening until I looked up TaskParameter. Suddenly, it made sense to me that what you're referring to as a task parameter is what I refer to as a contextual variable. This is also really good for me to see that contextual variable optimization is supported. However, I don't really consider this as transfer learning. In my mind, transfer learning means using one model to inform another. In contextual Bayesian optimization, certain variables are being fixed at each prediction. Perhaps I misunderstood something though. I imagine this will become clearer once https://emdgroup.github.io/baybe/userguide/transfer_learning.html has been developed.

Functionality

Multi-objective

It seems that Expected Hypervolume Improvement (EHVI) isn't one of the supported options for multi-objective optimization. Could you comment on this? With the DESIRABILITY mode, are each of the targets modeled independently prior to scalarization? If not, I tend to have a hard time referring to something like this as multi-objective optimization. In my mind, it's single-objective optimization of a fixed scalarization of several objectives. As alluded to in https://emdgroup.github.io/baybe/userguide/objective.html#desirability, it's good that a clarification is made about the scales being combined.

Batch conditioning

Do you perform conditioning on your batches (i.e., compute a joint acquisition function value)? For example, using fantasy point modeling. This is one of the easiest "gotchas" of batch optimization. See facebook/Ax#778 (comment) and https://youtu.be/JzgkSR6FFyM?si=dzv3RVvjKrZlkjlH

Comparison to other packages

What needs does BayBE fulfill that other packages don't? I think the README should clarify what makes BayBE stand apart from others and reference these other packages, too. For example, there's Ax (https://ax.dev), Gauche (https://github.com/leojklarner/gauche), Atlas (https://github.com/aspuru-guzik-group/atlas), Olympus, and https://github.com/experimental-design/bofire.

For example:

Ax is a general-purpose tool, also built on BoTorch, but has to be retrofitted in many cases to support wetlab experiment setups
Atlas is a nice framework, also based on BoTorch, but is not as well-maintained (single developer, non-recent commits - Riley is pretty busy)
Olympus is a nice benchmarking framework for Bayesian optimization for chemistry and materials science and supports comparison of many algorithms across many datasets; however, it has the same issues as with Atlas. Also, it isn't as straightforward to apply this to custom datasets, and it's limited to single-objective optimization without categorical parameters, if I'm not mistaken.
BoFire is also developed by chemistry/materials-oriented folks. It has been evolving, and is in a decently polished, though minimal state now. I don't think it natively supports chemical encodings, but it supports a number of other things, especially in terms of constraints (e.g., NChooseK constraints: https://experimental-design.github.io/bofire/nchoosek_constraint/)
Gauche is the most similar to BayBE in my mind. This is one where I suggest looking closely and considering similarities and differences. One of these differences could be in the vision/roadmap you have for BayBE, which may not be the same roadmap Gauche is intending.

I keep what is probably an overly inclusive list of GitHub repos at https://github.com/stars/sgbaird/lists/optimization-and-tuning and a shortlist at https://github.com/AccelerationConsortium/awesome-self-driving-labs/blob/main/readme.md#optimization. I added BayBE to these lists recently.

I'm also interested to see an optimization comparison/benchmark of using the Mordred encoding with the solvent vs. treating it as a purely categorical variable.

Software development

I can appreciate that BayBE seems well-maintained from a software developer perspective! This is welcome in the fields of chemistry and materials science, which understandably often lacks this.
There are a lot of dependencies. I'm glad you split them up into groups!
I notice you have a lot of >= dependencies in https://github.com/emdgroup/baybe/blob/main/pyproject.toml. Is this overly restrictive? It's OK if you don't think so.
The docstrings look really nice, and it's nice to have the function cross-linking across the API docs.
I look forward to seeing how you use hypothesis testing here!

Feel free to convert to a discussion if desired, and happy to refactor into multiple items if that would be better.

User guide: Simulation

The user guide for simulations needs to be written.

Telemetry code executes and throws in container, even with telemetry disabled

Hi team,

I'm running baybe in a container under a non-root user. I'm setting BAYBE_TELEMETRY_ENABLED=false. However, some telemetry code still seems to be executed.

File "/workdir/.venv/lib/python3.12/site-packages/baybe/telemetry.py", line 109, in <module>
--
hashlib.sha256(getpass.getuser().upper().encode()).hexdigest().upper()[:10]
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/getpass.py", line 169, in getuser
return pwd.getpwuid(os.getuid())[0]
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'getpwuid(): uid not found: 1000'

As a workaround, I will create a user in the container so that this call doesn't throw an exception. However, I would expect that no code is executed as part of loading the baybe module when telemetry is disabled.

In general, the telemetry code seems pretty heavy (making suspicious syscalls to retrieve hostname, uid, tries to make an HTTP request which will likely raise some flags when scanning code bases for malicious behavior or backdoors).

I'd also wanted to mention that this hash is irreversible and cannot identify the user or their machine is not necessarily true. You can easily pre-compute a rainbow table of all hashes for valid usernames and then do a reverse lookup.

User guide: Target

The user guide for targets needs an overhaul.

User guide: Campaigns

The current user guides for campaigns needs an overhaul.

User guide: Parameters

The user guide for the parameters might need an overhaul but needs to checked more thoroughly.

Mypy for parameters

Recommendations taking a long time

I'm wondering if the recommendation times I'm encountering are expected given my setup:

Machine:
MacBook Air, 15 inch, M2, 2023
Memory: 16 GB
OS: Sonoma 14.4.1
Python: 3.11.8

Model:
Single NumericalTarget
Parameters: 4 SubstanceParameters (~140 total SMILES molecules), 4 NumericalContinuousParameters
Constraints: 4 numerical parameters must sum to 1.0
Recommender: TwoPhaseMetaRecommender(
initial_recommender=RandomRecommender(),
recommender=SequentialGreedyRecommender())

So when I add 1000 datapoints via campaign.add_measurements() it takes ~4 days to make a recommendation with a batch size of 3. I started a test with only 10 datapoints and it is still running from overnight.

Does this sound expected given my machine, model, and data? If so, what would be the recommended ways to improve the speed? For the molecules I've tried with and without mordred & decorrelation, doesn't seem to make a big difference.

If this doesn't sound expected, how would you recommend I troubleshoot what could be causing the issue?

Thanks in advance!

Random seed being set somewhere hidden inside baybe?

Hi!

I was wondering if there is anywhere hidden inside baybe that could be locking a random seed somewhere inadvertently?

I am seeing that in a sequence where I call the baybe simulate_experiment method in a loop I am sometimes getting the same exact results on multiple iterations, and this carries over to a different method (defined by me) which also relies on calls to np.random. Im noticing that when I call simulate_experiment in the same loop as my method, my method is also returning identical results, but when I comment out simulate_experiment, my method goes back to returning random results.

I also tried manually setting the random_seed input to an iteration integer every time of the loop, but this still happens.

I can work on putting together a repro of this but just wanted to go ahead and put this out there to see if theres anywhere obvious this might be happening.

User guide: Recommender

The user guide for recommenders needs an overhaul.

Published docs use main instead of released version

The docs at https://emdgroup.github.io/baybe/ seem to be built from the main branch instead of 0.8.2. The API has changed in main so the example on the front page no longer works with the version installed from pypi.

Best way to represent a feature that is a variable-length vector of integers

Hi,

I am wondering what is the best way to encode a feature that is a variable-length vector of integers. Would this work out of the box?

If not, the way I would naturally do this is to find the datapoint with the longest vector for this feature, let's call this length N. Then I would define N features, where each element in the vector is a different feature. For datapoints with len(vector) < N, all features between len(vector) and N would be assigned a value of 0.

This seems not ideal though, since the number and identify of features depends on the data, which will change over time.

Is there a better way to do this, or is specific accommodation of this on the roadmap anywhere? Thanks!!

Cannot import 'get_canonical_smiles' from 'baybe.utils'

I am trying to run Baybe 0.7.2 with python 3.11 on a Windows machine.
I am trying out the example script from basic serialization (examples/Serialization/basic_serialization.py).
When trying to import Campaign, I get the following:


  Cell In[2], line 1
    from baybe import Campaign

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\__init__.py:5
    from baybe.campaign import Campaign

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\campaign.py:13
    from baybe.parameters.base import Parameter

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\parameters\__init__.py:14
    from baybe.parameters.substance import SubstanceParameter

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\parameters\substance.py:13
    from baybe.utils import (

ImportError: cannot import name 'get_canonical_smiles' from 'baybe.utils' (C:\Users\M316675\code\auto_flow\.venv\Lib\site-packages\baybe\utils\__init__.py)

I had a look, and it seems that the problem is that substance.py explicitly imports get_canonical_smiles. However, in the ./utils/chemistry.py module, get_canonical_smiles is only defined if rdkit is installed.
I don't have rdkit installed, so it fails when trying to import 'get_canonical_smiles' in substance.py.

I would recommend including rdkit in the dependencies for this project.

Validation of Campaign object for active_values in TaskParameter in case of string

I experienced unexpected behaviour in the validation when using a string as input type for active_values in the TaskParamter instead of a list.

I created the TaskParameter with a single-char string for active_values, i.e. active_value="A" for testing purposes. The Campaign object could be created without a problem. Unexpectedly, I got an error when creating the Campaign object in case the string for active_values was longer than 1, for example active_value="C12":

As @Scienfitz already pointed out, it is related to the input type, i.e. list vs string, and that the string is interpreted as a list in the validation. Therefore "C" is not in ["C12", ..] fails whereas "A" is not in ["A", ..] works in case of a single-char string. To provide robustness or a clearer error message in that case, the validation could check whether active_values is a list or string and act accordingly.

When I input active_values as a list, i.e. active_value=["C12"], the validation and creation of the Campaign object works without problems.

ONNX Vulnerabilities

We currently ignore two ONNX vulnerabilities that appear in all versions <=1.15. ONce version 1.16.0 is released (announced for march 18) we need to check these again.

/remind me to deploy on Mar 18

Printing a campaign object can be overly verbose when using chemical descriptors

Maybe this is intended, but here is an example of the campaign object from the README:

Campaign(searchspace=SearchSpace(discrete=SubspaceDiscrete(parameters=[CategoricalParameter(name='Granularity', _values=('coarse', 'medium', 'fine'), encoding=<CategoricalEncoding.OHE: 'OHE'>), NumericalDiscreteParameter(name='Pressure[bar]', encoding=None, _values=(1.0, 5.0, 10.0), tolerance=0.2), SubstanceParameter(name='Solvent', data={'Solvent A': 'COC', 'Solvent B': 'CCC', 'Solvent C': 'O', 'Solvent D': 'CS(=O)C'}, decorrelate=True, encoding=<SubstanceEncoding.MORDRED: 'MORDRED'>)], exp_rep=   Granularity  Pressure[bar]    Solvent
0       coarse            1.0  Solvent A
1       coarse            1.0  Solvent B
2       coarse            1.0  Solvent C
3       coarse            1.0  Solvent D
4       coarse            5.0  Solvent A
5       coarse            5.0  Solvent B
6       coarse            5.0  Solvent C
7       coarse            5.0  Solvent D
8       coarse           10.0  Solvent A
9       coarse           10.0  Solvent B
10      coarse           10.0  Solvent C
11      coarse           10.0  Solvent D
12      medium            1.0  Solvent A
13      medium            1.0  Solvent B
14      medium            1.0  Solvent C
15      medium            1.0  Solvent D
16      medium            5.0  Solvent A
17      medium            5.0  Solvent B
18      medium            5.0  Solvent C
19      medium            5.0  Solvent D
20      medium           10.0  Solvent A
21      medium           10.0  Solvent B
22      medium           10.0  Solvent C
23      medium           10.0  Solvent D
24        fine            1.0  Solvent A
25        fine            1.0  Solvent B
26        fine            1.0  Solvent C
27        fine            1.0  Solvent D
28        fine            5.0  Solvent A
29        fine            5.0  Solvent B
30        fine            5.0  Solvent C
31        fine            5.0  Solvent D
32        fine           10.0  Solvent A
33        fine           10.0  Solvent B
34        fine           10.0  Solvent C
35        fine           10.0  Solvent D, metadata=    was_recommended  was_measured  dont_recommend
0             False         False           False
1             False         False           False
2             False         False           False
3             False         False           False
4             False         False           False
5             False         False           False
6             False         False           False
7             False         False           False
8             False         False           False
9             False         False           False
10             True          True           False
11            False         False           False
12            False         False           False
13            False         False           False
14            False         False           False
15             True          True           False
16            False         False           False
17            False         False           False
18            False         False           False
19            False         False           False
20            False         False           False
21            False         False           False
22            False         False           False
23            False         False           False
24            False         False           False
25            False         False           False
26            False         False           False
27            False         False           False
28            False         False           False
29             True          True           False
30            False         False           False
31            False         False           False
32            False         False           False
33            False         False           False
34            False         False           False
35            False         False           False, empty_encoding=False, constraints=[], comp_rep=    Granularity_coarse  Granularity_medium  Granularity_fine  Pressure[bar]  \
0                    1                   0                 0            1.0   
1                    1                   0                 0            1.0   
2                    1                   0                 0            1.0   
3                    1                   0                 0            1.0   
4                    1                   0                 0            5.0   
5                    1                   0                 0            5.0   
6                    1                   0                 0            5.0   
7                    1                   0                 0            5.0   
8                    1                   0                 0           10.0   
9                    1                   0                 0           10.0   
10                   1                   0                 0           10.0   
11                   1                   0                 0           10.0   
12                   0                   1                 0            1.0   
13                   0                   1                 0            1.0   
14                   0                   1                 0            1.0   
15                   0                   1                 0            1.0   
16                   0                   1                 0            5.0   
17                   0                   1                 0            5.0   
18                   0                   1                 0            5.0   
19                   0                   1                 0            5.0   
20                   0                   1                 0           10.0   
21                   0                   1                 0           10.0   
22                   0                   1                 0           10.0   
23                   0                   1                 0           10.0   
24                   0                   0                 1            1.0   
25                   0                   0                 1            1.0   
26                   0                   0                 1            1.0   
27                   0                   0                 1            1.0   
28                   0                   0                 1            5.0   
29                   0                   0                 1            5.0   
30                   0                   0                 1            5.0   
31                   0                   0                 1            5.0   
32                   0                   0                 1           10.0   
33                   0                   0                 1           10.0   
34                   0                   0                 1           10.0   
35                   0                   0                 1           10.0   

    Solvent_MORDRED_SpAbs_A  Solvent_MORDRED_nHetero  Solvent_MORDRED_ATS1dv  \
0                  2.828427                      1.0               12.000000   
1                  2.828427                      0.0                4.000000   
2                  0.000000                      1.0                0.000000   
3                  3.464102                      2.0                5.333333   
4                  2.828427                      1.0               12.000000   
5                  2.828427                      0.0                4.000000   
6                  0.000000                      1.0                0.000000   
7                  3.464102                      2.0                5.333333   
8                  2.828427                      1.0               12.000000   
9                  2.828427                      0.0                4.000000   
10                 0.000000                      1.0                0.000000   
11                 3.464102                      2.0                5.333333   
12                 2.828427                      1.0               12.000000   
13                 2.828427                      0.0                4.000000   
14                 0.000000                      1.0                0.000000   
15                 3.464102                      2.0                5.333333   
16                 2.828427                      1.0               12.000000   
17                 2.828427                      0.0                4.000000   
18                 0.000000                      1.0                0.000000   
19                 3.464102                      2.0                5.333333   
20                 2.828427                      1.0               12.000000   
21                 2.828427                      0.0                4.000000   
22                 0.000000                      1.0                0.000000   
23                 3.464102                      2.0                5.333333   
24                 2.828427                      1.0               12.000000   
25                 2.828427                      0.0                4.000000   
26                 0.000000                      1.0                0.000000   
27                 3.464102                      2.0                5.333333   
28                 2.828427                      1.0               12.000000   
29                 2.828427                      0.0                4.000000   
30                 0.000000                      1.0                0.000000   
31                 3.464102                      2.0                5.333333   
32                 2.828427                      1.0               12.000000   
33                 2.828427                      0.0                4.000000   
34                 0.000000                      1.0                0.000000   
35                 3.464102                      2.0                5.333333   

    Solvent_MORDRED_AATS0dv  Solvent_MORDRED_ATSC2p  
0                  4.222222                1.072052  
1                  0.545455               -0.939884  
2                  5.333333                0.002031  
3                  3.844444               -3.587209  
4                  4.222222                1.072052  
5                  0.545455               -0.939884  
6                  5.333333                0.002031  
7                  3.844444               -3.587209  
8                  4.222222                1.072052  
9                  0.545455               -0.939884  
10                 5.333333                0.002031  
11                 3.844444               -3.587209  
12                 4.222222                1.072052  
13                 0.545455               -0.939884  
14                 5.333333                0.002031  
15                 3.844444               -3.587209  
16                 4.222222                1.072052  
17                 0.545455               -0.939884  
18                 5.333333                0.002031  
19                 3.844444               -3.587209  
20                 4.222222                1.072052  
21                 0.545455               -0.939884  
22                 5.333333                0.002031  
23                 3.844444               -3.587209  
24                 4.222222                1.072052  
25                 0.545455               -0.939884  
26                 5.333333                0.002031  
27                 3.844444               -3.587209  
28                 4.222222                1.072052  
29                 0.545455               -0.939884  
30                 5.333333                0.002031  
31                 3.844444               -3.587209  
32                 4.222222                1.072052  
33                 0.545455               -0.939884  
34                 5.333333                0.002031  
35                 3.844444               -3.587209  ), continuous=SubspaceContinuous(parameters=[], constraints_lin_eq=[], constraints_lin_ineq=[])), objective=Objective(mode='SINGLE', targets=[NumericalTarget(name='Yield', mode='MAX', bounds=Interval(lower=-inf, upper=inf), bounds_transform_func=None)], weights=[100.0], combine_func='GEOM_MEAN'), strategy=TwoPhaseStrategy(allow_repeated_recommendations=False, allow_recommending_already_measured=False, initial_recommender=FPSRecommender(), recommender=SequentialGreedyRecommender(surrogate_model=GaussianProcessSurrogate(model_params={}, _model=None), acquisition_function_cls='qEI', hybrid_sampler='None', sampling_percentage=1.0), switch_after=1), measurements_exp=  Granularity  Pressure[bar]    Solvent  Yield  BatchNr  FitNr
0      medium            1.0  Solvent D   79.8        1    NaN
1      coarse           10.0  Solvent C   54.1        1    NaN
2        fine            5.0  Solvent B   59.4        1    NaN
3      medium            1.0  Solvent D   79.8        2    NaN
4      coarse           10.0  Solvent C   54.1        2    NaN
5        fine            5.0  Solvent B   59.4        2    NaN, numerical_measurements_must_be_within_tolerance=True, n_batches_done=2, n_fits_done=0, _cached_recommendation=Empty DataFrame
Columns: []
Index: [])

Operator in ThresholdCondition gives ValueError

Thanks for sharing this work! BayBE is an amazing tool for BO and I sincerely enjoy going through the code. Amazing work, highly appreciated!

Here's one issue that I found:
When one includes operators such as ">", "<", "<=", ">=", a ValueError is given. This is probably due to the list of valid operators, which is set to be ["=", "==", "!="] in conditions.py file (_valid_tolerance_operators = ["=", "==", "!="]). Might be worth taking a look into it (or did I miss something?)

Cheers!

Update baybe/examples/Basics /campaign.py and baybe/examples/Serialization /basic_serialization.py

In baybe/examples/Basics /campaign.py, on line 80 recommendation = campaign.recommend(batch_size=2).
The argument to campaign.recommend has been changed to batch_quantity.
Please update the example.

This is also a problem in baybe/examples/Serialization /basic_serialization.py on lines 97, 98.