GithubHelp home page GithubHelp logo

Comments (12)

AdrianSosic avatar AdrianSosic commented on June 7, 2024 1

Hi @brandon-holt, thanks for the detailed error report 🥇 There is no obvious problem that I can spot immediately, but I'm happy to investigate. Will try to reproduce the error on my machine. In case I struggle to reproduce, I'll reach out again.

Two comments upfront:

  • I guess "QC" stands for "quality control", which suggest to me that your target is really a binary variable here? I think modeling it as a NumericalTarget is fine since you only have one such target. But I still wanted to let you know that categorical targets are already on our roadmap 🎯
  • The simulation module is still one of our legacy modules for which a refactoring is overdue and already planned. So it could be absolutely the case that you are running into some form of edge case here. I think we should be able to figure it out ✌🏼

from baybe.

AdrianSosic avatar AdrianSosic commented on June 7, 2024 1

Hi @brandon-holt, I haven't been able to reproduce the error so far. I created a random data matrix like the one you showed and the simulation keeps running without problems... Can you somehow provide me with an exact reproducible example?

from baybe.

brandon-holt avatar brandon-holt commented on June 7, 2024 1

@AdrianSosic Yes absolutely, will do! Just will take some time because I am OOO today and will also need to take some time to recreate with private data stripped. Thanks for all your help!

from baybe.

AdrianSosic avatar AdrianSosic commented on June 7, 2024 1

Hi @brandon-holt, perfect, thanks for sharing 👍🏼 I just had a look and found the issue quickly, it's really a (rather common) case that we somehow forgot to consider in the simulation code: your lookup data contains "semi-duplicates", i.e. rows that contain identical parameter configurations but with different values for the target variable. "Pure" duplicates wouldn't have been a problem because the merge automatically takes care of that. However, if the targets differ, the question arises what should actually happen in the simulation.

My first thought is that we need to specify an additional resolution mechanism, i.e. similar to the impute_mode for missing lookup values, we could have a duplicate_mode that dictates how to handle duplicate entries. And choices could be something like:

  • random, where in each iteration a random row will be selected (mimicking what would happen in a real-world run)
  • mean, where the duplicate target values get averaged before the simulation
  • callable, where the user provides a custom resolution strategy in form of a callable
  • ...

I'll discuss this with the dev team but would also like to hear your opinion on it 🙃

from baybe.

AdrianSosic avatar AdrianSosic commented on June 7, 2024 1

Always there to assist 🖖🏼 And glad I could help with the invariance constraints!

Thanks for the mode suggestions, will keep it in mind for the refactoring 👍🏼

from baybe.

Scienfitz avatar Scienfitz commented on June 7, 2024

hey, no idea what could have gone wrong if you dont also include the lines how you've called the function and prepared the campaign. Can you post this?

from baybe.

brandon-holt avatar brandon-holt commented on June 7, 2024

@Scienfitz Totally, my apologies!

from baybe.

brandon-holt avatar brandon-holt commented on June 7, 2024

@AdrianSosic Here you go, here's a kit to reproduce the error I'm seeing! If you run sim_bug_repro.py, you should get IndexError: boolean index did not match indexed array along dimension 0; dimension is 1428840 but corresponding boolean dimension is 1428879

I am running Python 3.11.8. Please let me know if you need any other info to reproduce the error, or have questions about the code!

Thanks :)

from baybe.

Scienfitz avatar Scienfitz commented on June 7, 2024

@AdrianSosic the lookup of the actual values arleady incorporates that possibility, see:

            if len(ind) > 1:
                # More than two instances of this parameter combination
                # have been measured
                _logger.warning(
                    "The lookup rows with indexes %s seem to be "
                    "duplicates regarding parameter values. Choosing a "
                    "random one.",
                    ind,
                )
                match_vals = lookup.loc[np.random.choice(ind), target_names].values

in simulation.py

But it seems the pre-filtering in case if ignore impute option is chosen does not account for that causing semi-duplicates to change the index length:
image

Since the merge will result in duplciate rows, we could just drop duplicates, no other change would be neede

        missing_inds = searchspace.index[
            searchspace.merge(lookup, how="left", indicator=True).drop_duplicates().set_index(searchspace.index)["_merge"]
            == "left_only"
        ]

resulting in
image

from baybe.

AdrianSosic avatar AdrianSosic commented on June 7, 2024

Hi @brandon-holt. So the error is now fixed on the main branch. If you run a

pip install git+https://github.com/emdgroup/baybe.git@main

the simulation should work 👍🏼 An upcoming release is also on the horizon, probably next week.

That said, I only implemented the hotfix for now but not the duplicate_mode option because – as @Scienfitz correctly pointed out – the most basic case (random) is already happening automatically in the code. In fact, you'll see some warnings when you run the code that inform you about your duplicates.

Nonetheless, the bigger simulation refactoring is still pending and we can extend the logic there. Happy to hear your thoughts/wishes!

from baybe.

AdrianSosic avatar AdrianSosic commented on June 7, 2024

@brandon-holt, perhaps one other unrelated comment:

I noticed that you are modeling some form of mixture and you are using the "slot-based" representation to describe your search space (i.e. field containing the substance + field containing the abundance) but as far as I can see you are not dealing with the issue of permutation invariance.

For the simulation example in ignore mode it doesn't really matter since permuted search space elements will automatically be dropped (due having no match in the lookup) but for an actual DOE loop this would cause a problem since the model would not be aware of the invariance. I suggest to have a look here ;)

from baybe.

brandon-holt avatar brandon-holt commented on June 7, 2024

@AdrianSosic @Scienfitz Thank you both this is awesome!!

Regarding the DiscretePermutationInvarianceConstraint, thank you so much for pointing me towards that, and indirectly towards the examples on using mixtures. This is exactly what I need to be using :)

Regarding the options for "semi-duplicate" resolution, I think random and mean probably cover 90+% of cases and callable takes care of the rest.

If I had to think of other options I might suggest "mode" for discrete targets which picks the output that occurs most frequently.

Another could be "recent" which picks the most recent (by row index) since data like this is usually appended over time, and some users might just want to use the most recent data points as the ones they trust more for empirical reasons.

Again, thanks so much for being so on top of things, you all have built something really awesome!!!

from baybe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.