GithubHelp home page GithubHelp logo

Comments (7)

Scienfitz avatar Scienfitz commented on May 24, 2024 2

Hi @mmortazavi ,
summarizing in a slightly more concise way:

  • your original issue was that you tried to create parameters form set on non-unique values like NumericalDiscreteParameter(name='bla', vlaues=[1,2,3,1,2,3]). We don't allow this to avoid possible downstream problems. Instead just provide unique values like NumericalDiscreteParameter (name='bla', values=[1,2,3]). If you get these values from a data frame just deduplify them before vals = np.unique([1,2,3,1,2,3,...])

  • The space you've created is excessively large. This happens if the combination of number of parameters and their possible values is too large. This is a technical memory limitation and can only be circumvented by reconsidering how you design your campaign, see below

  • This memory problem can only happen in discrete spaces. It looks to me like many of your variables like FF, FWHM, 2-theta, Intensity, relative_intensity would make better continuous variables (In fact the names make me suspect that these are not parameters but rather measurements). For that we have NumericalContinuousParameter

from baybe.

AdrianSosic avatar AdrianSosic commented on May 24, 2024 1

Hi @mmortazavi, glad to hear that this makes sense to you!

Regarding your new error: strictly speaking, this is not directly related to BayBE but to the way how you attempt to represent and create your search space. You see, if you try to build the full Cartesian product of 25 parameters, the corresponding array/Dataframe to hold the resulting set of configurations becomes exponentially large and you will run out of memory very quickly. Unfortunately, I can't see the exact sizes of each parameter from the code you shared (i.e., what are the numbers of possible values each parameter can take?) but even for the smallest numbers you get astronomically large sizes quickly. For instance, if each parameter can take 2 possible values only, the resulting Cartesian product already consists of 2^25=33.554.432 elements (still manageable), for three it's already 3^25=847.288.609.443 elements, ... you see where this goes.

So how to solve this?
The answer depends a bit on what you try to achieve. Is it really the full product space that you want to optimize over?

  • If "yes", then an enumeration-based approach is simply not feasible due to the combinatorial explosion of configurations. Instead, what you could do is to go via a continuous relaxation of the problem. That is, represent your parameters a continuous ones, which will trigger a gradient-based optimization in the backend. You can then map the found real-valued configuration back to the closest possible discrete configuration. Currently, this additional step needs to be carried out manually, but having automatic routines for this are on our roadmap. Nonetheless, this would be a possible alternative even with the current capabilities of BayBE.

  • If "no" and you can restrict your search to reasonably small subsets of the Cartesian product, then you can keep your existing discrete-parameter approach and simply swap out the search space creation step.

    Instead of

    searchspace = SearchSpace.from_product(parameters)

    write

    from baybe.searchspace import SubspaceDiscrete
    searchspace = SearchSpace(discrete=SubspaceDiscrete.from_dataframe(configurations, parameters))

    where configurations is a DataFrame containing exactly the subset configurations that you like to consider for your optimization.

Let me know if this helps πŸ‘πŸΌ

from baybe.

AdrianSosic avatar AdrianSosic commented on May 24, 2024

Hi @mmortazavi, thank you very much! ❀️‍πŸ”₯ Great to see that people are already interacting with the framework πŸ…Getting user input at this stage will be really helpful for us to shape our APIs and also to adapt to the various use cases that are out there.

To your question: The fact that a parameter may not contain duplicate values is by design. Probably, it's up to us to better explain why this is the case (we are currently working on the docs!). But to give you the idea: The purpose of the parameter objects is to define what (physical) configurations are possible in the first place. While you can repeat an experiment for a given value, say 1, this would still refer to the same underlying setting in the parameter space. Of course, having duplicate experiments is perfectly valid, on the other hand. These would then enter by having different entries in your measurements dataframe that refer to that same underlying parameter setting, though.

For example, this is a perfectly valid setting:

import pandas as pd

from baybe import Campaign
from baybe.objective import Objective
from baybe.parameters import NumericalDiscreteParameter
from baybe.searchspace import SearchSpace
from baybe.targets import NumericalTarget

parameters = NumericalDiscreteParameter(name="param", values=[1, 2, 3])
targets = NumericalTarget(name="target", mode="MAX")
objective = Objective(targets=[targets], mode="SINGLE")
searchspace = SearchSpace.from_product([parameters])
campaign = Campaign(searchspace=searchspace, objective=objective)

# You can provide as many duplicates as you wish. Each record will be treated as a
# single data point that enters the model. Duplicate parameter settings simply provide
# more evidence about the target values for those configurations, helping to reduce
# model uncertainty.
data_containing_duplicates = pd.DataFrame.from_records(
    [
        {"param": 1, "target": 0},
        {"param": 1, "target": 0},  # exact duplicate
        {"param": 1, "target": 1},  # duplicate parameter setting, different measurement
        {"param": 2, "target": 5},
    ]
)
campaign.add_measurements(data_containing_duplicates)

But perhaps I misunderstood the question?

from baybe.

mmortazavi avatar mmortazavi commented on May 24, 2024

@AdrianSosic Pleasure. I will do my best to continue the tool and provide feedback. I can see that there are rooms for improvements, documentations, real-data examples from various industry domains, features etc.

Many thanks for the detailed answer. I now understand the parameter objects, possible (physical) configurations are totally reasonable, meaning unique set of configurations needs to be set and defined in the parameter space. I have dealt with that in my real physical data. However, now I am facing another challenge. Not sure this thread is the right place though to discuss about the follow up problem!

I have 25 parameters, and in total 27 experiments (i.e. rows). When I want to create the searchspace (Cartesian product of all possibilities):

searchspace = SearchSpace.from_product(parameters)
campaign = Campaign(searchspace=searchspace, objective=objective)

I get the following traceback:

     54     return _wrapit(obj, method, *args, **kwds)
     56 try:
---> 57     return bound(*args, **kwds)
     58 except TypeError:
     59     # A TypeError occurs if the object does have such a method in its
     60     # class, but its signature is not identical to that of NumPy's. This
   (...)
     64     # Call _wrapit from within the except clause to ensure a potential
     65     # exception has a traceback chain.
     66     return _wrapit(obj, method, *args, **kwds)

MemoryError: Unable to allocate 10.8 PiB for an array with shape (12150000000000000,) and data type int8

Somehow I feel the search space has become large, but in fact it is not! I have tried other ways to create the Searchspace, or even another methods in the doc to create the search space, so far my attemps failed!

If you suggest I post this in separate issue, let me know, I can do that, to elaborate a bit more even.

full code:

import pandas as pd
from baybe import Campaign
from baybe.objective import Objective
from baybe.parameters import NumericalDiscreteParameter, CategoricalParameter
from baybe.searchspace import SearchSpace, SubspaceDiscrete, SubspaceContinuous
from baybe.targets import NumericalTarget

df.head(2)

DROP        FF  FWHM 1  FWHM 2  FWHM 3  FWHM 4  FWHM 5   FWHM 6  2-theta 1  2-theta 2  2-theta 3  2-theta 4  2-theta 5  2-theta 6  Intensity 1  Intensity 2  Intensity 3  Intensity 4  Intensity 5  Intensity 6  relative_intensity 1  relative_intensity 2  relative_intensity 3  relative_intensity 4  relative_intensity 5  relative_intensity 6
    1 38.487449    0.45    0.35    0.51    0.57    0.54 0.925902      32.25       34.9      36.86      48.29      57.39      63.89          0.5          0.4         0.78         0.15          0.3         0.18                  64.1                 51.28                 100.0                 19.23                 38.46                 23.07
    2 41.509692    0.45    0.35    0.51    0.57    0.54 0.925902      32.25       34.9      36.86      48.29      57.39      63.89          0.5          0.4         0.78         0.15          0.3         0.18                  64.1                 51.28                 100.0                 19.23                 38.46                 23.07

numerical_df = df.drop(['FF', "DROP"], axis=1)
categorical_df = df[["DROP"]]
target_df = df[["FF"]]

target = NumericalTarget(
    name="FF",
    mode="MAX",
)
objective = Objective(mode="SINGLE", targets=[target])

parameters = []

for numerical_col in numerical_df.columns:
    parameters.append(NumericalDiscreteParameter(
                                                name=numerical_col,
                                                values=tuple(set(df[numerical_col])),
                                                ))
for categorical_col in categorical_df.columns:
    parameters.append(CategoricalParameter(
                                        name=categorical_col,
                                        values=tuple(set(df[categorical_col])),
                                        encoding="INT",
                                        ))
    
searchspace = SearchSpace.from_product(parameters)
campaign = Campaign(searchspace=searchspace, objective=objective)

from baybe.

mmortazavi avatar mmortazavi commented on May 24, 2024

Thanks @Scienfitz and @AdrianSosic for your hints and recommendations.
As explained above, the problem of create parameters form set on non-unique value is understood and dealt with.
The memory problem as a result of SearchSpace.from_product(parameters) is also understandable. I believe @AdrianSosic your line of thinking makes more sense to my problem.

Essentially "Is it really the full product space that you want to optimize over?" is "no".

Since we are discussing back and force about the suitable method available at baybe, let me describe the problem at hand better. As @Scienfitz guessed correctly, I have a limited number of experimental measurements. The sample data I posted is about features generated from XRD of experimental samples, plus a substance (DROP) ,and the target is basically is Fill Factor (FF) (i.e. is a measure of the "squareness" of the solar cell and is also the area of the largest rectangle which will fit in the IV curve).

The ultimate goal is, given the set of conducted experimental measurements, how (based on what next experiment) FF can be maximized? I am trying to use Bayesian methods to explore the search space, and for a physics-driven exploration-exploitation trade-off, get the next (or a set) of recommended experiment. In this case, if XRD generated-features are used, would provide guidelines what next growth mechanism can be used in lab for the next batch!

I am wondering myself now, if XRD driven features, shall be considered a discrete search space (I am leaning towards this since we have more control over the next measurement), or rather continues space!

Currently, based on earlier suggestions, I am using the below code:

from baybe import Campaign
from baybe.objective import Objective
from baybe.parameters import NumericalDiscreteParameter, CategoricalParameter, NumericalContinuousParameter
from baybe.searchspace import SearchSpace, SubspaceDiscrete, SubspaceContinuous
from baybe.targets import NumericalTarget

numerical_df = df.drop(['FF', "DROP"], axis=1)
categorical_df = df[["DROP"]]
target_df = df[["FF"]]

target = NumericalTarget(
    name="FF",
    mode="MAX",
)
objective = Objective(mode="SINGLE", targets=[target])

parameters = []

for numerical_col in numerical_df.columns:
    parameters.append(NumericalDiscreteParameter(
                                                name=numerical_col,
                                                values=tuple(set(df[numerical_col])),
                                                ))
for categorical_col in categorical_df.columns:
    parameters.append(CategoricalParameter(
                                        name=categorical_col,
                                        values=tuple(set(df[categorical_col])),
                                        encoding="INT",
                                        ))
    
searchspace = SearchSpace(discrete=SubspaceDiscrete.from_dataframe(df, parameters))
campaign = Campaign(searchspace=searchspace, objective=objective)

Whether, given the last explanation of the problem at hand, the apporach is valid or not, I am still puzzled how from the created campaign object, we optimize and search for the next points. Simply using campaign.recommend(batch_quantity=3) doesn't seem correct, is it? At least from here, I get 3 rows of data, exactly the same as I have in my data.

And here SearchSpace(discrete=SubspaceDiscrete.from_dataframe(df, parameters)), the df I pass is the whole dataframe containig my target. I saw in one of the doc, and target can or shall be added later like in one of the demos:

df["Target"] = [values...]
campaign.add_measurements(df)

Perhaps you can guide me here for a full example, if already in documentation. I have already used Python bayes_opt package to achieve the same thing. There I have used ML-model as a surrogate model to estimate the unknown objective function for the underlying data and use Bayesian to identify the next point of interest. However, there I have less flexibility on defining my search space (discrete for instance is not straightforward). I was thinking to go with BoTorch, then I stumped upon BayBe!

I wonder also what happens in the background, as of my current tests, if I do not provide a surrogate model! Is by default e.g. one of the available standard ones like GaussianProcessSurrogate is used?

As you can see, I am not an expert in Bayesian methods, and trying to familiarize myself with its limitation and formulate the problem at hand as correctly as possible to analyse my data.

from baybe.

Scienfitz avatar Scienfitz commented on May 24, 2024

@mmortazavi

The utility SubspaceDiscrete.from_dataframe(configurations) will create a searchspace that only searches combinations that are present in configurations. If your configurations equals your existing measurements you will only ever be recommended the same configurations you already measured. Also I see you did not remove the target and hence it is added as parameter there too.

It seems you never inform the campaign about your measurements, hence you cannot expect any smart recommendation. Creating the SearchSpace just means informing the campaign about what parameters there are and what are their possible values, but not about actually performed measurements that you might already have.

This is done with campaign.add_measurements.
Did you follow this basic example closely?
https://emdgroup.github.io/baybe/examples/Basics/campaign.html
It contains all steps, even if you create the search space in a different manner, the subsequent workflow is the same.

Your question about the strategy: The strategy is optional and a default one is selected if not provided. The default one includes a Gaussian Process as surrogate with a SequentialGreedy Optimizer. You can create your own strategy for instance like this:

strategy = TwoPhaseStrategy(
    initial_recommender=INITIAL_RECOMMENDER,
    recommender=SequentialGreedyRecommender(
        surrogate_model=SURROGATE_MODEL, acquisition_function_cls=ACQ_FUNCTION
    ),
    allow_repeated_recommendations=ALLOW_REPEATED_RECOMMENDATIONS,
    allow_recommending_already_measured=ALLOW_RECOMMENDING_ALREADY_MEASURED,
)

for more info consult https://emdgroup.github.io/baybe/examples/Basics/strategies.html#creating-the-strategy-object.

The TwoPhaseStrategy by default will use a Bayesian algorithm to give you your recommendation. But ONLY if data is available. If there was no data added via add_measurements yet and you haven't selected any other algorithm, you will get a random recommendation which might just be what you observed.

from baybe.

AdrianSosic avatar AdrianSosic commented on May 24, 2024

Hi @mmortazavi. Since there was no more response from your end, I assume that all questions could be clarified? Therefore, I will close this issue now. However, please feel free to reopen any time in case you like to continue the discussion. Best, Adrian πŸ€™πŸΌ

from baybe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.