Comments (12)
Hi @brandon-holt, thanks for the detailed error report 🥇 There is no obvious problem that I can spot immediately, but I'm happy to investigate. Will try to reproduce the error on my machine. In case I struggle to reproduce, I'll reach out again.
Two comments upfront:
- I guess "QC" stands for "quality control", which suggest to me that your target is really a binary variable here? I think modeling it as a
NumericalTarget
is fine since you only have one such target. But I still wanted to let you know that categorical targets are already on our roadmap 🎯 - The
simulation
module is still one of our legacy modules for which a refactoring is overdue and already planned. So it could be absolutely the case that you are running into some form of edge case here. I think we should be able to figure it out ✌🏼
from baybe.
Hi @brandon-holt, I haven't been able to reproduce the error so far. I created a random data matrix like the one you showed and the simulation keeps running without problems... Can you somehow provide me with an exact reproducible example?
from baybe.
@AdrianSosic Yes absolutely, will do! Just will take some time because I am OOO today and will also need to take some time to recreate with private data stripped. Thanks for all your help!
from baybe.
Hi @brandon-holt, perfect, thanks for sharing 👍🏼 I just had a look and found the issue quickly, it's really a (rather common) case that we somehow forgot to consider in the simulation code: your lookup data contains "semi-duplicates", i.e. rows that contain identical parameter configurations but with different values for the target variable. "Pure" duplicates wouldn't have been a problem because the merge automatically takes care of that. However, if the targets differ, the question arises what should actually happen in the simulation.
My first thought is that we need to specify an additional resolution mechanism, i.e. similar to the impute_mode
for missing lookup values, we could have a duplicate_mode
that dictates how to handle duplicate entries. And choices could be something like:
random
, where in each iteration a random row will be selected (mimicking what would happen in a real-world run)mean
, where the duplicate target values get averaged before the simulation- callable, where the user provides a custom resolution strategy in form of a callable
- ...
I'll discuss this with the dev team but would also like to hear your opinion on it 🙃
from baybe.
Always there to assist 🖖🏼 And glad I could help with the invariance constraints!
Thanks for the mode suggestions, will keep it in mind for the refactoring 👍🏼
from baybe.
hey, no idea what could have gone wrong if you dont also include the lines how you've called the function and prepared the campaign. Can you post this?
from baybe.
@Scienfitz Totally, my apologies!
from baybe.
@AdrianSosic Here you go, here's a kit to reproduce the error I'm seeing! If you run sim_bug_repro.py, you should get IndexError: boolean index did not match indexed array along dimension 0; dimension is 1428840 but corresponding boolean dimension is 1428879
I am running Python 3.11.8. Please let me know if you need any other info to reproduce the error, or have questions about the code!
Thanks :)
from baybe.
@AdrianSosic the lookup of the actual values arleady incorporates that possibility, see:
if len(ind) > 1:
# More than two instances of this parameter combination
# have been measured
_logger.warning(
"The lookup rows with indexes %s seem to be "
"duplicates regarding parameter values. Choosing a "
"random one.",
ind,
)
match_vals = lookup.loc[np.random.choice(ind), target_names].values
in simulation.py
But it seems the pre-filtering in case if ignore
impute option is chosen does not account for that causing semi-duplicates to change the index length:
Since the merge will result in duplciate rows, we could just drop duplicates, no other change would be neede
missing_inds = searchspace.index[
searchspace.merge(lookup, how="left", indicator=True).drop_duplicates().set_index(searchspace.index)["_merge"]
== "left_only"
]
from baybe.
Hi @brandon-holt. So the error is now fixed on the main
branch. If you run a
pip install git+https://github.com/emdgroup/baybe.git@main
the simulation should work 👍🏼 An upcoming release is also on the horizon, probably next week.
That said, I only implemented the hotfix for now but not the duplicate_mode
option because – as @Scienfitz correctly pointed out – the most basic case (random
) is already happening automatically in the code. In fact, you'll see some warnings when you run the code that inform you about your duplicates.
Nonetheless, the bigger simulation refactoring is still pending and we can extend the logic there. Happy to hear your thoughts/wishes!
from baybe.
@brandon-holt, perhaps one other unrelated comment:
I noticed that you are modeling some form of mixture and you are using the "slot-based" representation to describe your search space (i.e. field containing the substance + field containing the abundance) but as far as I can see you are not dealing with the issue of permutation invariance.
For the simulation example in ignore
mode it doesn't really matter since permuted search space elements will automatically be dropped (due having no match in the lookup) but for an actual DOE loop this would cause a problem since the model would not be aware of the invariance. I suggest to have a look here ;)
from baybe.
@AdrianSosic @Scienfitz Thank you both this is awesome!!
Regarding the DiscretePermutationInvarianceConstraint, thank you so much for pointing me towards that, and indirectly towards the examples on using mixtures. This is exactly what I need to be using :)
Regarding the options for "semi-duplicate" resolution, I think random and mean probably cover 90+% of cases and callable takes care of the rest.
If I had to think of other options I might suggest "mode" for discrete targets which picks the output that occurs most frequently.
Another could be "recent" which picks the most recent (by row index) since data like this is usually appended over time, and some users might just want to use the most recent data points as the ones they trust more for empirical reasons.
Again, thanks so much for being so on top of things, you all have built something really awesome!!!
from baybe.
Related Issues (20)
- Cannot import 'get_canonical_smiles' from 'baybe.utils' HOT 3
- Update baybe/examples/Basics /campaign.py and baybe/examples/Serialization /basic_serialization.py HOT 3
- Handling of Infinity in serialization HOT 7
- ONNX Vulnerabilities HOT 6
- Python 3.12 blocked by failing config checks HOT 3
- Batch_size error HOT 10
- Validation of Campaign object for active_values in TaskParameter in case of string HOT 4
- Installation with Poetry fails with "Package 'baybe[telemetry]' is listed as a dependency of itself." HOT 4
- Expose underlying model of campaign HOT 4
- Recommendations taking a long time HOT 13
- Minor visual issues in the documentation HOT 14
- Error in match mode when trying to take mean of bounds HOT 7
- Return incomplete results when simulation errors out HOT 10
- Random seed being set somewhere hidden inside baybe? HOT 12
- Published docs use main instead of released version HOT 4
- Estimate shape of search space? HOT 6
- ModuleNotFoundError: No module named 'baybe.objectives' HOT 10
- Telemetry code executes and throws in container, even with telemetry disabled HOT 6
- Best way to represent a feature that is a variable-length vector of integers HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from baybe.