emdgroup / baybe Goto Github PK

View Code? Open in Web Editor NEW

189.0 9.0 32.0 3.84 MB

Bayesian Optimization and Design of Experiments

Home Page: https://emdgroup.github.io/baybe/

License: Apache License 2.0

Python 100.00%

bayesian-optimization design-of-experiments machine-learning active-learning experimental-design optimization

baybe's Issues

Installation with Poetry fails with "Package 'baybe[telemetry]' is listed as a dependency of itself."

When trying to install baybe in an env with Poetry, it fails with the error Package 'baybe[telemetry]' is listed as a dependency of itself.

Steps to reproduce on MacOS:

poetry --version
> Poetry (version 1.8.2)
poetry init
# create a dummy package with Poetry, Python version ~3.11.0
cat pyproject.toml
> ...
> [tool.poetry.dependencies]
> python = "~3.11.0"
> 
> 
> [build-system]
> requires = ["poetry-core"]
> build-backend = "poetry.core.masonry.api"
> ...

poetry add baybe                                                               

> Using version ^0.8.1 for baybe
> 
> Updating dependencies
> Resolving dependencies... (0.3s)
> 
> Package 'baybe[telemetry]' is listed as a dependency of itself.

Handling of Infinity in serialization

When de-serializing Campaign objects, we sometimes have to deal with inf values (for example, as boundaries to the objective target). When these values are serialized to JSON, they are replaced by Infinity objects. Infinity objects are actually Javascript objects, but are not supported by JSON specifications (https://web.archive.org/web/20160414190115/http://tools.ietf.org/html/rfc4627).

Python's JSON library has no problem transforming back and forth between JSON and dict objects when Infinity is present. However, some other solutions that utilize JSON (for example, MongoDB) do not understand Infinity and cast the value to null. There could be potential problems for other downstream consumers that strictly adhere to the JSON specifications.

Here are some resources discussing the issue and potential solutions.:
https://medium.com/the-magic-pantry/infinity-and-json-cde6df62c17c
https://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript

Minor visual issues in the documentation

This issue is intended to serve as a place where minor visual issues regarding the documentation can be collected. Once enough such have been identified, they will be resolved in a corresponding PR.

Note in particular that this issue should not be used to collect major changes. Basically, if you see something in our documentation and think "Huh, that just looks a bit weird and would benefit from reformatting", then this is the place to just put it 😄

User guide: Campaigns

The current user guides for campaigns needs an overhaul.

Published docs use main instead of released version

The docs at https://emdgroup.github.io/baybe/ seem to be built from the main branch instead of 0.8.2. The API has changed in main so the example on the front page no longer works with the version installed from pypi.

Printing a campaign object can be overly verbose when using chemical descriptors

Maybe this is intended, but here is an example of the campaign object from the README:

Campaign(searchspace=SearchSpace(discrete=SubspaceDiscrete(parameters=[CategoricalParameter(name='Granularity', _values=('coarse', 'medium', 'fine'), encoding=<CategoricalEncoding.OHE: 'OHE'>), NumericalDiscreteParameter(name='Pressure[bar]', encoding=None, _values=(1.0, 5.0, 10.0), tolerance=0.2), SubstanceParameter(name='Solvent', data={'Solvent A': 'COC', 'Solvent B': 'CCC', 'Solvent C': 'O', 'Solvent D': 'CS(=O)C'}, decorrelate=True, encoding=<SubstanceEncoding.MORDRED: 'MORDRED'>)], exp_rep=   Granularity  Pressure[bar]    Solvent
0       coarse            1.0  Solvent A
1       coarse            1.0  Solvent B
2       coarse            1.0  Solvent C
3       coarse            1.0  Solvent D
4       coarse            5.0  Solvent A
5       coarse            5.0  Solvent B
6       coarse            5.0  Solvent C
7       coarse            5.0  Solvent D
8       coarse           10.0  Solvent A
9       coarse           10.0  Solvent B
10      coarse           10.0  Solvent C
11      coarse           10.0  Solvent D
12      medium            1.0  Solvent A
13      medium            1.0  Solvent B
14      medium            1.0  Solvent C
15      medium            1.0  Solvent D
16      medium            5.0  Solvent A
17      medium            5.0  Solvent B
18      medium            5.0  Solvent C
19      medium            5.0  Solvent D
20      medium           10.0  Solvent A
21      medium           10.0  Solvent B
22      medium           10.0  Solvent C
23      medium           10.0  Solvent D
24        fine            1.0  Solvent A
25        fine            1.0  Solvent B
26        fine            1.0  Solvent C
27        fine            1.0  Solvent D
28        fine            5.0  Solvent A
29        fine            5.0  Solvent B
30        fine            5.0  Solvent C
31        fine            5.0  Solvent D
32        fine           10.0  Solvent A
33        fine           10.0  Solvent B
34        fine           10.0  Solvent C
35        fine           10.0  Solvent D, metadata=    was_recommended  was_measured  dont_recommend
0             False         False           False
1             False         False           False
2             False         False           False
3             False         False           False
4             False         False           False
5             False         False           False
6             False         False           False
7             False         False           False
8             False         False           False
9             False         False           False
10             True          True           False
11            False         False           False
12            False         False           False
13            False         False           False
14            False         False           False
15             True          True           False
16            False         False           False
17            False         False           False
18            False         False           False
19            False         False           False
20            False         False           False
21            False         False           False
22            False         False           False
23            False         False           False
24            False         False           False
25            False         False           False
26            False         False           False
27            False         False           False
28            False         False           False
29             True          True           False
30            False         False           False
31            False         False           False
32            False         False           False
33            False         False           False
34            False         False           False
35            False         False           False, empty_encoding=False, constraints=[], comp_rep=    Granularity_coarse  Granularity_medium  Granularity_fine  Pressure[bar]  \
0                    1                   0                 0            1.0   
1                    1                   0                 0            1.0   
2                    1                   0                 0            1.0   
3                    1                   0                 0            1.0   
4                    1                   0                 0            5.0   
5                    1                   0                 0            5.0   
6                    1                   0                 0            5.0   
7                    1                   0                 0            5.0   
8                    1                   0                 0           10.0   
9                    1                   0                 0           10.0   
10                   1                   0                 0           10.0   
11                   1                   0                 0           10.0   
12                   0                   1                 0            1.0   
13                   0                   1                 0            1.0   
14                   0                   1                 0            1.0   
15                   0                   1                 0            1.0   
16                   0                   1                 0            5.0   
17                   0                   1                 0            5.0   
18                   0                   1                 0            5.0   
19                   0                   1                 0            5.0   
20                   0                   1                 0           10.0   
21                   0                   1                 0           10.0   
22                   0                   1                 0           10.0   
23                   0                   1                 0           10.0   
24                   0                   0                 1            1.0   
25                   0                   0                 1            1.0   
26                   0                   0                 1            1.0   
27                   0                   0                 1            1.0   
28                   0                   0                 1            5.0   
29                   0                   0                 1            5.0   
30                   0                   0                 1            5.0   
31                   0                   0                 1            5.0   
32                   0                   0                 1           10.0   
33                   0                   0                 1           10.0   
34                   0                   0                 1           10.0   
35                   0                   0                 1           10.0   

    Solvent_MORDRED_SpAbs_A  Solvent_MORDRED_nHetero  Solvent_MORDRED_ATS1dv  \
0                  2.828427                      1.0               12.000000   
1                  2.828427                      0.0                4.000000   
2                  0.000000                      1.0                0.000000   
3                  3.464102                      2.0                5.333333   
4                  2.828427                      1.0               12.000000   
5                  2.828427                      0.0                4.000000   
6                  0.000000                      1.0                0.000000   
7                  3.464102                      2.0                5.333333   
8                  2.828427                      1.0               12.000000   
9                  2.828427                      0.0                4.000000   
10                 0.000000                      1.0                0.000000   
11                 3.464102                      2.0                5.333333   
12                 2.828427                      1.0               12.000000   
13                 2.828427                      0.0                4.000000   
14                 0.000000                      1.0                0.000000   
15                 3.464102                      2.0                5.333333   
16                 2.828427                      1.0               12.000000   
17                 2.828427                      0.0                4.000000   
18                 0.000000                      1.0                0.000000   
19                 3.464102                      2.0                5.333333   
20                 2.828427                      1.0               12.000000   
21                 2.828427                      0.0                4.000000   
22                 0.000000                      1.0                0.000000   
23                 3.464102                      2.0                5.333333   
24                 2.828427                      1.0               12.000000   
25                 2.828427                      0.0                4.000000   
26                 0.000000                      1.0                0.000000   
27                 3.464102                      2.0                5.333333   
28                 2.828427                      1.0               12.000000   
29                 2.828427                      0.0                4.000000   
30                 0.000000                      1.0                0.000000   
31                 3.464102                      2.0                5.333333   
32                 2.828427                      1.0               12.000000   
33                 2.828427                      0.0                4.000000   
34                 0.000000                      1.0                0.000000   
35                 3.464102                      2.0                5.333333   

    Solvent_MORDRED_AATS0dv  Solvent_MORDRED_ATSC2p  
0                  4.222222                1.072052  
1                  0.545455               -0.939884  
2                  5.333333                0.002031  
3                  3.844444               -3.587209  
4                  4.222222                1.072052  
5                  0.545455               -0.939884  
6                  5.333333                0.002031  
7                  3.844444               -3.587209  
8                  4.222222                1.072052  
9                  0.545455               -0.939884  
10                 5.333333                0.002031  
11                 3.844444               -3.587209  
12                 4.222222                1.072052  
13                 0.545455               -0.939884  
14                 5.333333                0.002031  
15                 3.844444               -3.587209  
16                 4.222222                1.072052  
17                 0.545455               -0.939884  
18                 5.333333                0.002031  
19                 3.844444               -3.587209  
20                 4.222222                1.072052  
21                 0.545455               -0.939884  
22                 5.333333                0.002031  
23                 3.844444               -3.587209  
24                 4.222222                1.072052  
25                 0.545455               -0.939884  
26                 5.333333                0.002031  
27                 3.844444               -3.587209  
28                 4.222222                1.072052  
29                 0.545455               -0.939884  
30                 5.333333                0.002031  
31                 3.844444               -3.587209  
32                 4.222222                1.072052  
33                 0.545455               -0.939884  
34                 5.333333                0.002031  
35                 3.844444               -3.587209  ), continuous=SubspaceContinuous(parameters=[], constraints_lin_eq=[], constraints_lin_ineq=[])), objective=Objective(mode='SINGLE', targets=[NumericalTarget(name='Yield', mode='MAX', bounds=Interval(lower=-inf, upper=inf), bounds_transform_func=None)], weights=[100.0], combine_func='GEOM_MEAN'), strategy=TwoPhaseStrategy(allow_repeated_recommendations=False, allow_recommending_already_measured=False, initial_recommender=FPSRecommender(), recommender=SequentialGreedyRecommender(surrogate_model=GaussianProcessSurrogate(model_params={}, _model=None), acquisition_function_cls='qEI', hybrid_sampler='None', sampling_percentage=1.0), switch_after=1), measurements_exp=  Granularity  Pressure[bar]    Solvent  Yield  BatchNr  FitNr
0      medium            1.0  Solvent D   79.8        1    NaN
1      coarse           10.0  Solvent C   54.1        1    NaN
2        fine            5.0  Solvent B   59.4        1    NaN
3      medium            1.0  Solvent D   79.8        2    NaN
4      coarse           10.0  Solvent C   54.1        2    NaN
5        fine            5.0  Solvent B   59.4        2    NaN, numerical_measurements_must_be_within_tolerance=True, n_batches_done=2, n_fits_done=0, _cached_recommendation=Empty DataFrame
Columns: []
Index: [])

Re-writing Campaign User Guide

This issue is created to make everybody (specifically @Scienfitz and @AdrianSosic) aware of me starting to re-write the user guide for campaigns.
The corresponding discussion will be closed, and all comments regarding this issue that I should be aware of before opening the PR can be posted here.

Operator in ThresholdCondition gives ValueError

Thanks for sharing this work! BayBE is an amazing tool for BO and I sincerely enjoy going through the code. Amazing work, highly appreciated!

Here's one issue that I found:
When one includes operators such as ">", "<", "<=", ">=", a ValueError is given. This is probably due to the list of valid operators, which is set to be ["=", "==", "!="] in conditions.py file (_valid_tolerance_operators = ["=", "==", "!="]). Might be worth taking a look into it (or did I miss something?)

Cheers!

User guide: Transfer Learning

The user guide for transfer learning needs to be written.

Docs: Alignment of member overview

this should be aligned left to look nicer

ModuleNotFoundError: No module named 'baybe.objectives'

Steps to reproduce:

pip install baybe
python -c "from baybe.objectives import SingleTargetObjective"

(from README)

i do see the module in the repo, but it appears it's not packaged in the version on PyPI?

Python 3.12 blocked by failing config checks

The tests that ensure an invalid config is checked failes in python 3.12 (not in others)

seems related to cattrs

also, it is possible to write a config with recomEnDer (spelling mistake) and it will take a default instead of throwing an error

PR for Python 3.12 upgrade: #153

User guide: Target

The user guide for targets needs an overhaul.

Simulation bug in `ignore` mode

Hello! I get an error when trying to run simulate_experiment. The code errors here in simulation.py

        searchspace = campaign.searchspace.discrete.exp_rep
        missing_inds = searchspace.index[
            searchspace.merge(lookup, how="left", indicator=True)["_merge"]
            == "left_only"
        ]
        campaign.searchspace.discrete.metadata.loc[
            missing_inds, "dont_recommend"
        ] = True

The error is IndexError: boolean index did not match indexed array along dimension 0; dimension is 131220 but corresponding boolean dimension is 131227

I'm a bit confused since the error happens because the length of the missing_inds and campaign.searchspace.discrete.metadata don't match up, but they are both derived from campaign.searchspace.discrete?

What might be going on here?

Telemetry code executes and throws in container, even with telemetry disabled

Hi team,

I'm running baybe in a container under a non-root user. I'm setting BAYBE_TELEMETRY_ENABLED=false. However, some telemetry code still seems to be executed.

File "/workdir/.venv/lib/python3.12/site-packages/baybe/telemetry.py", line 109, in <module>
--
hashlib.sha256(getpass.getuser().upper().encode()).hexdigest().upper()[:10]
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/getpass.py", line 169, in getuser
return pwd.getpwuid(os.getuid())[0]
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'getpwuid(): uid not found: 1000'

As a workaround, I will create a user in the container so that this call doesn't throw an exception. However, I would expect that no code is executed as part of loading the baybe module when telemetry is disabled.

In general, the telemetry code seems pretty heavy (making suspicious syscalls to retrieve hostname, uid, tries to make an HTTP request which will likely raise some flags when scanning code bases for malicious behavior or backdoors).

I'd also wanted to mention that this hash is irreversible and cannot identify the user or their machine is not necessarily true. You can easily pre-compute a rainbow table of all hashes for valid usernames and then do a reverse lookup.

User guide: Surrogate

The user guide for surrogates needs an overhaul.

General feedback - documentation, functionality, and software dev

I had a nice time reviewing this repository! Overall I think it's a really comprehensive, clean, and well-documented project. Thank you for open-sourcing it! Find below some questions and suggestions:

Docs improvements

General

It has a nice, clean look, both on the GitHub README and on the documentation site. This is really important I think.

README

A Colab notebook or similar would be really good I think. See e.g., https://colab.research.google.com/drive/1VEHXBLVkn5NZ7N-Oj6-dc_hkIfwFcUE-?usp=sharing. I needed to %pip install 'baybe[chem,simulation]' numpy==1.24.4 on Colab, otherwise it seems to work OK (see numpy/numpy#25150 (comment)). Consider moving "Quick Start" into a tutorial notebook and provide a Colab link. It looks like you already have it set up to convert Jupyter notebooks into html pages (e.g., https://emdgroup.github.io/baybe/examples/Constraints_Discrete/mixture_constraints.html)
It appears that the README example is doing only a single iteration. I would have expected to see an optimization loop and some information about best parameters, though I get that this is geared more towards wetlab scientists.
Maybe clarify in the README that people can choose from different scaling methods with a link to the docs? I eventually happened upon that part of the docs.
I think the detailed installation information could go into an "Advanced Installation" section (either a separate README that gets incorporated into the docs, or near the end of the README). Within the main README in a "quick installation" section, then include a link to the advanced installation instructions. See #95
The README example doesn't have much by way of outputs (e.g., print statements and expected output). See #95
Same for visual representation, such as an optimization trace using BayBE on a task. Are there any built-in visualization methods? If not, consider including at least some examples of visualizing performance

Webpage

It would be nice to have an "Edit on GitHub" link on your documentation pages -- it makes it a lot easier for others to contribute I think. See #94
It would be nice if the user guide linked to a corresponding tutorial or section of tutorials. For example, linking https://emdgroup.github.io/baybe/userguide/strategy.html to https://emdgroup.github.io/baybe/examples/Basics/strategies.html#
At a glance, this was difficult to parse: Similar to the SequentialStrategy, the StreamingSequentialStrategy enables the utilization of arbitrary iterables to select recommender. Note that this strategy is however not serializable. (https://emdgroup.github.io/baybe/userguide/strategy.html#the-streamingsequentialstrategy). I think I kind of get it, but not necessarily when or how I would want to use it.
I think there is too much granularity on some of your docs pages on pages like https://emdgroup.github.io/baybe/examples/Constraints_Discrete/Constraints_Discrete.html (i.e., lots of repeat, not a whole lot of valuable information gained from the bottom-most headings). No worries if this would be difficult to change.
It would be nice to get some more details about each of the "Examples" sections rather than needing to click into each one to better understand what it's about. I.e., https://emdgroup.github.io/baybe/examples/examples.html could have some text at the top.
As I'm going into more of the tutorials, I'm seeing that it's really comprehensive. For example, a demonstration of adding existing data https://emdgroup.github.io/baybe/examples/Backtesting/full_initial_data.html. I think there needs to be a better way to highlight/organize/point people to the tutorials they care about most. Happy to discuss more.

Terminology

Backtesting

I don't think "Backtesting" is common terminology for chem/materials informatics communities, at least in North America. It seems to be more common in finance, for example: https://en.wikipedia.org/wiki/Backtesting. When I wandered into https://emdgroup.github.io/baybe/_autosummary/baybe.simulation.html#module-baybe.simulation, I finally realized that what you refer to as simulation and backtesting is what I would typically refer to as benchmarking. I was thinking that maybe you implemented multi-task BO, where you could leverage physics-based simulations to help inform wetlab/experimental search campaigns. It took a while before this became clear to me.

Transfer learning

Right now, "Transfer learning: Mix data from multiple campaigns and accelerate optimization" is mentioned on https://emdgroup.github.io/baybe/misc/readme_link.html#, but it doesn't seem like this is really implemented yet, other than https://emdgroup.github.io/baybe/_autosummary/baybe.simulation.simulate_transfer_learning.html#baybe.simulation.simulate_transfer_learning. However, it doesn't appear to me that transfer learning is being used here. Even going through the function (https://emdgroup.github.io/baybe/_modules/baybe/simulation.html#simulate_transfer_learning), it was a bit tough to realize what was happening until I looked up TaskParameter. Suddenly, it made sense to me that what you're referring to as a task parameter is what I refer to as a contextual variable. This is also really good for me to see that contextual variable optimization is supported. However, I don't really consider this as transfer learning. In my mind, transfer learning means using one model to inform another. In contextual Bayesian optimization, certain variables are being fixed at each prediction. Perhaps I misunderstood something though. I imagine this will become clearer once https://emdgroup.github.io/baybe/userguide/transfer_learning.html has been developed.

Functionality

Multi-objective

It seems that Expected Hypervolume Improvement (EHVI) isn't one of the supported options for multi-objective optimization. Could you comment on this? With the DESIRABILITY mode, are each of the targets modeled independently prior to scalarization? If not, I tend to have a hard time referring to something like this as multi-objective optimization. In my mind, it's single-objective optimization of a fixed scalarization of several objectives. As alluded to in https://emdgroup.github.io/baybe/userguide/objective.html#desirability, it's good that a clarification is made about the scales being combined.

Batch conditioning

Do you perform conditioning on your batches (i.e., compute a joint acquisition function value)? For example, using fantasy point modeling. This is one of the easiest "gotchas" of batch optimization. See facebook/Ax#778 (comment) and https://youtu.be/JzgkSR6FFyM?si=dzv3RVvjKrZlkjlH

Comparison to other packages

What needs does BayBE fulfill that other packages don't? I think the README should clarify what makes BayBE stand apart from others and reference these other packages, too. For example, there's Ax (https://ax.dev), Gauche (https://github.com/leojklarner/gauche), Atlas (https://github.com/aspuru-guzik-group/atlas), Olympus, and https://github.com/experimental-design/bofire.

For example:

Ax is a general-purpose tool, also built on BoTorch, but has to be retrofitted in many cases to support wetlab experiment setups
Atlas is a nice framework, also based on BoTorch, but is not as well-maintained (single developer, non-recent commits - Riley is pretty busy)
Olympus is a nice benchmarking framework for Bayesian optimization for chemistry and materials science and supports comparison of many algorithms across many datasets; however, it has the same issues as with Atlas. Also, it isn't as straightforward to apply this to custom datasets, and it's limited to single-objective optimization without categorical parameters, if I'm not mistaken.
BoFire is also developed by chemistry/materials-oriented folks. It has been evolving, and is in a decently polished, though minimal state now. I don't think it natively supports chemical encodings, but it supports a number of other things, especially in terms of constraints (e.g., NChooseK constraints: https://experimental-design.github.io/bofire/nchoosek_constraint/)
Gauche is the most similar to BayBE in my mind. This is one where I suggest looking closely and considering similarities and differences. One of these differences could be in the vision/roadmap you have for BayBE, which may not be the same roadmap Gauche is intending.

I keep what is probably an overly inclusive list of GitHub repos at https://github.com/stars/sgbaird/lists/optimization-and-tuning and a shortlist at https://github.com/AccelerationConsortium/awesome-self-driving-labs/blob/main/readme.md#optimization. I added BayBE to these lists recently.

I'm also interested to see an optimization comparison/benchmark of using the Mordred encoding with the solvent vs. treating it as a purely categorical variable.

Software development

I can appreciate that BayBE seems well-maintained from a software developer perspective! This is welcome in the fields of chemistry and materials science, which understandably often lacks this.
There are a lot of dependencies. I'm glad you split them up into groups!
I notice you have a lot of >= dependencies in https://github.com/emdgroup/baybe/blob/main/pyproject.toml. Is this overly restrictive? It's OK if you don't think so.
The docstrings look really nice, and it's nice to have the function cross-linking across the API docs.
I look forward to seeing how you use hypothesis testing here!

Feel free to convert to a discussion if desired, and happy to refactor into multiple items if that would be better.

Return incomplete results when simulation errors out

Hi! I was wondering if it would be possible to add a feature where simulations will still return the results compiled up to the point of an error?

The situation I'm running into when running on larger datasets is a botorch error to the tune of
All attempts to fit the model have failed.

I am in the process of troubleshooting what about the dataset is causing the failure, but in the meantime it would be nice to see the results up to that point, which should include dozens of batches of experiments.

Also, if you have any experience with what might be causing an error like this, that would be helpful!

Referring to this comment in a botorch thread: pytorch/botorch#1226 (comment)
I initially wondered if this could be my issue, but baybe should prevent this from being an issue since it identifies duplicate parameter values and randomly picks one.

Update baybe/examples/Basics /campaign.py and baybe/examples/Serialization /basic_serialization.py

In baybe/examples/Basics /campaign.py, on line 80 recommendation = campaign.recommend(batch_size=2).
The argument to campaign.recommend has been changed to batch_quantity.
Please update the example.

This is also a problem in baybe/examples/Serialization /basic_serialization.py on lines 97, 98.

Mypy for parameters

ONNX Vulnerabilities

We currently ignore two ONNX vulnerabilities that appear in all versions <=1.15. ONce version 1.16.0 is released (announced for march 18) we need to check these again.

/remind me to deploy on Mar 18

User guide: Search spaces

The user guide for search spaces needs an overhaul.

Validation of Campaign object for active_values in TaskParameter in case of string

I experienced unexpected behaviour in the validation when using a string as input type for active_values in the TaskParamter instead of a list.

I created the TaskParameter with a single-char string for active_values, i.e. active_value="A" for testing purposes. The Campaign object could be created without a problem. Unexpectedly, I got an error when creating the Campaign object in case the string for active_values was longer than 1, for example active_value="C12":

As @Scienfitz already pointed out, it is related to the input type, i.e. list vs string, and that the string is interpreted as a list in the validation. Therefore "C" is not in ["C12", ..] fails whereas "A" is not in ["A", ..] works in case of a single-char string. To provide robustness or a clearer error message in that case, the validation could check whether active_values is a list or string and act accordingly.

When I input active_values as a list, i.e. active_value=["C12"], the validation and creation of the Campaign object works without problems.

Cannot import 'get_canonical_smiles' from 'baybe.utils'

I am trying to run Baybe 0.7.2 with python 3.11 on a Windows machine.
I am trying out the example script from basic serialization (examples/Serialization/basic_serialization.py).
When trying to import Campaign, I get the following:


  Cell In[2], line 1
    from baybe import Campaign

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\__init__.py:5
    from baybe.campaign import Campaign

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\campaign.py:13
    from baybe.parameters.base import Parameter

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\parameters\__init__.py:14
    from baybe.parameters.substance import SubstanceParameter

  File ~\code\auto_flow\.venv\Lib\site-packages\baybe\parameters\substance.py:13
    from baybe.utils import (

ImportError: cannot import name 'get_canonical_smiles' from 'baybe.utils' (C:\Users\M316675\code\auto_flow\.venv\Lib\site-packages\baybe\utils\__init__.py)

I had a look, and it seems that the problem is that substance.py explicitly imports get_canonical_smiles. However, in the ./utils/chemistry.py module, get_canonical_smiles is only defined if rdkit is installed.
I don't have rdkit installed, so it fails when trying to import 'get_canonical_smiles' in substance.py.

I would recommend including rdkit in the dependencies for this project.

User guide: Parameters

The user guide for the parameters might need an overhaul but needs to checked more thoroughly.

Estimate shape of search space?

Is there a way to calculate an upper bound for the shape of the search space? I'm trying to find a way to give the user feedback on the complexity of the search space and also how much memory needs to be allocated to proceed with the campaign and generate experiments.

Error in match mode when trying to take mean of bounds

Hi,

When running a simulation in match mode, the code fails when trying to generate results because it attempts to take a np.mean() of the Interval(), returning error TypeError: unsupported operand type(s) for: 'Interval' and 'int'

The error occurs on line 551 of simulation.py here

elif target.mode is TargetMode.MATCH:
    match_val = np.mean(target.bounds)

In the meantime just changing the line locally to

match_val = np.mean([target.bounds.lower, target.bounds.upper])

Assuming this is just from outdated code since you mentioned this was an old module in need of a refresh!

User guide: Recommender

The user guide for recommenders needs an overhaul.

User guide: Strategy

The user guide for strategies needs an overhaul.

User guide: Objective

The user guide for objectives needs an overhaul.

Access to non-persistent data such as acqf functions and the overall model

Thank you for the wonderful tool！ I have a question about the architecture. Is there a way to add already observed experimental data to a campaign in experiments? Also, is there a way to check the average and variance of the recommended experimental points and exploration space, as well as the evaluation values of the acquisition function in a dataframe？

User guide: Constraints

The user guide for the constraints might need an overhaul but needs to checked more thoroughly.

Recommendations taking a long time

I'm wondering if the recommendation times I'm encountering are expected given my setup:

Machine:
MacBook Air, 15 inch, M2, 2023
Memory: 16 GB
OS: Sonoma 14.4.1
Python: 3.11.8

Model:
Single NumericalTarget
Parameters: 4 SubstanceParameters (~140 total SMILES molecules), 4 NumericalContinuousParameters
Constraints: 4 numerical parameters must sum to 1.0
Recommender: TwoPhaseMetaRecommender(
initial_recommender=RandomRecommender(),
recommender=SequentialGreedyRecommender())

So when I add 1000 datapoints via campaign.add_measurements() it takes ~4 days to make a recommendation with a batch size of 3. I started a test with only 10 datapoints and it is still running from overnight.

Does this sound expected given my machine, model, and data? If so, what would be the recommended ways to improve the speed? For the molecules I've tried with and without mordred & decorrelation, doesn't seem to make a big difference.

If this doesn't sound expected, how would you recommend I troubleshoot what could be causing the issue?

Thanks in advance!

Random seed being set somewhere hidden inside baybe?

Hi!

I was wondering if there is anywhere hidden inside baybe that could be locking a random seed somewhere inadvertently?

I am seeing that in a sequence where I call the baybe simulate_experiment method in a loop I am sometimes getting the same exact results on multiple iterations, and this carries over to a different method (defined by me) which also relies on calls to np.random. Im noticing that when I call simulate_experiment in the same loop as my method, my method is also returning identical results, but when I comment out simulate_experiment, my method goes back to returning random results.

I also tried manually setting the random_seed input to an iteration integer every time of the loop, but this still happens.

I can work on putting together a repro of this but just wanted to go ahead and put this out there to see if theres anywhere obvious this might be happening.

Cannot assign the following values containing duplicates to parameter X

Thanks for being such a cool abstraction for deisng experiments using the powerful Bayesian methods.

I have a set of experimental data that each contains various parameters. I am not sure it is by design, but in experiments it is natural that for multiple experiments that a parameter to hold similar values (duplicates), however, when defining the parameters either CategoricalParameter or NumericalDiscreteParameter, or NumericalContinuousParameter...I get the traceback error as an example for one of the parameters:
ValueError: Cannot assign the following values containing duplicates to parameter FeatureName: (1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6).

Mypy for targets

Expose underlying model of campaign

Hi, is it/would it be possible to expose the underlying model of a campaign in order to calculate predicted means and variances of a a set of new measurements (not necessarily those recommended by the campaign, but any user-specified measurement that exists within the search space)?

Ideally I'd like to be able to quantify the performance of the model on a set of known measurements as well.

Thanks!

Best way to represent a feature that is a variable-length vector of integers

Hi,

I am wondering what is the best way to encode a feature that is a variable-length vector of integers. Would this work out of the box?

If not, the way I would naturally do this is to find the datapoint with the longest vector for this feature, let's call this length N. Then I would define N features, where each element in the vector is a different feature. For datapoints with len(vector) < N, all features between len(vector) and N would be assigned a value of 0.

This seems not ideal though, since the number and identify of features depends on the data, which will change over time.

Is there a better way to do this, or is specific accommodation of this on the roadmap anywhere? Thanks!!

User guide: Simulation

The user guide for simulations needs to be written.

Batch_size error

Hi,

I am getting this error "TypeError: Campaign.recommend() got an unexpected keyword argument 'batch_size'". I am attaching snapshot for your reference. Could you please help me to sort this error out because till yesterday it was working fine.