Comments (7)
if you had 1000 possible values, it'd take 1000 samples to enumerate all of them but ~7000 in expectation if you were generating randomly
This makes sense! Viewed through this lens, Hypothesis only sort of enumerates. When generating a new value (DataTree.generate_novel_prefix
internally), we try rejection sampling first: given a draw_integer(0, 1000)
ir node, we randomly and uniformly generate a value in (0, 1000), resampling as necessary. If we ran this to completion, our EV is what you stated. But what we actually do is rejection sample 10 times, and if all 10 fail, exhaustively enumerate (to balance memory and randomness, enumerate only the first 100 children then sample from those). The intuition here is to try hard to satisfy the natural requested distribution, but fall back to enumeration if we're at the tail end and things get too difficult. I dunno what the EV of this is, but it feels around 2n? Not great, but not as bad as it could be, and avoids the tail end pitfall. 10 retries was a somewhat arbitrary choice here, fwiw.
We could potentially take max_examples
into account when deciding when to enumerate...but I would be cautious of doing so. We use our max example budget for things other than generating novel elements, like optimizing target()
, replaying from the databases, or mutating old inputs. Maybe you would argue we shouldn't be doing any of those things if we could exhaustively enumerate instead?
Based on what you're both saying, it seems like the timezone_keys issue is an unexpected one, whereas lists is expected, if a little unfortunate. Does that seem right? Should I keep submitting reports like these as they come up?
That would be my analysis 👍. And please do keep submitting reports of strategy inefficiencies, either in discarding/overruns or in duplicates! Some of them probably would have been improved in time anyway, but I suspect there are a fair number of strategies which will require manual tweaking even after the ir. Either because they were always inefficient and nobody noticed, or we intentionally chose an inefficient generation for e.g. good shrinking behavior which has since been alleviated by the ir.
from hypothesis.
An update: timezone_keys
duplicates are from generate_mutations_from
, which still operates on sub-ir examples and bypasses datatree-based duplicate detection. This should be fixed once we migrate generate_mutations_from
to the ir, which is in turn easier once we finish the shrinker migration (pr soon!) and remove sub-ir examples.
I expect lots of other strategies are similarly over-duplicated due to this, including vanilla integers
.
from hypothesis.
Those discards are actually mostly strategy-independent, and come from our logic of capping the size of early inputs. In our first roughly 10% of inputs, if we generate something too large, we'll throw it away. This shows up as overruns because we stop it early by setting a low max_length of the ConjectureData:
I don't know whether this behavior is ideal, but thankfully it at least won't compound in the way you were afraid of 😅. Some reasoning behind the change here: #2219.
we should theoretically be able to switch to enumeration 100% of the time.
We actually already try to enumerate all the time! More specifically, we track what inputs we previously generated, and avoid generating them again. If we exhaust this choice tree (called DataTree
internally), we terminate early.
This deduplication used to happen at the bitstream level, but now happens at the ir level — one of the big selling points of the ir (#3921). ex: [True, 2, True, 5, False] represents the list [2, 5] from st.lists(st.integers())
, where the boolean draws are the list asking if it should generate another element.
This deduplication is better than it used to be, because while neither bitstream ↦ input nor ir ↦ input is injective, the latter is much more injective than the former. But we're still in the midst of the ir migration and haven't seen the full benefits yet. My intuition is timezone_keys
— which is basically sampled_from
internally — should be perfectly tracked by the ir....but I haven't looked at the details of sampled_from
to know for sure.
It's great to have an example of a strategy, timezone_keys(allow_prefix=False)
, which should have good deduplication, but doesn't — thanks!
Here's a neat illustration of our duplication tracking:
from collections import Counter
seen = []
@given(st.integers(0, 50))
@settings(database=None)
def f(n):
seen.append(n)
f()
print(len(set(seen)))
print(Counter(seen))
51
Counter({49: 3, 10: 3, 26: 3, 44: 2, 29: 2, 41: 2, 16: 2, 15: 2, 28: 2, 24: 2, 0: 1, 5: 1, 43: 1, 45: 1, 27: 1, 47: 1, 39: 1, 3: 1, 2: 1, 9: 1, 8: 1, 17: 1, 4: 1, 42: 1, 13: 1, 35: 1, 23: 1, 32: 1, 31: 1, 18: 1, 1: 1, 50: 1, 38: 1, 33: 1, 21: 1, 7: 1, 19: 1, 22: 1, 12: 1, 40: 1, 20: 1, 25: 1, 48: 1, 37: 1, 11: 1, 6: 1, 14: 1, 30: 1, 46: 1, 36: 1, 34: 1})
My guess is 49
is duplicated because we upweight near-endpoints on integers, outside of the ir layer (something I'm looking to improve!). I actually have no clue why other numbers like 10
or 26
are overrepresented, though. Something to look into...
from hypothesis.
timezone_keys(allow_prefix=False)
is literally just a sampled_from()
, which should be bijective with the IR via draw_integer()
if there are no filters on it. Weird 🤔
More generally, thanks for the report, and thanks Liam for covering what I'd say! I suspect that a bunch of this will go away naturally by the time we finish migrating to the IR, but we should definitely track such issues to make sure we fix them.
from hypothesis.
Thanks for all of that context, both of you!
One thing I want to clarify: when I say "enumeration" I mean something slightly different from "remembering what we've generated and not generating that again". If you know ahead of time that you'll want every possible value from a strategy, you should be able to (modulo some implementation details) just list them off, one by one, without making random choices at all. This is much more efficient and effective than generating randomly and remembering previously generated values: if you had 1000 possible values, it'd take 1000 samples to enumerate all of them but ~7000 in expectation if you were generating randomly (assuming a uniform distribution, which Hypothesis doesn't provide). Does that make sense? Or did I misunderstand what you were getting at?
Based on what you're both saying, it seems like the timezone_keys
issue is an unexpected one, whereas lists
is expected, if a little unfortunate. Does that seem right? Should I keep submitting reports like these as they come up?
from hypothesis.
Yeah, I think I would argue that if we can conservatively conclude that exhaustive enumeration would require fewer inputs than our limit, we should just fall back to that — I don't see a downside as long as that estimate is accurate.
from hypothesis.
With #4007, we now deduplicate timezone_keys
as expected.
The only remaining planned win for deduplication is st.integers(min_value=n, max_value=m)
with m - n > 127
, where we currently draw two integers internally but ideally draw only one. Otherwise, the latest release should have all the deduplication benefits we expect to get from the IR. If you discover otherwise, we'd love to know – and thanks for the report here!
from hypothesis.
Related Issues (20)
- from_model throws ResolutionFailed for models with user-specified autoincrement id fields HOT 5
- Small optimization in `find_annotated_strategy` HOT 1
- Generation of incorrect values HOT 1
- Fix stack-depth of warning in `@st.composite`
- AssertionError in generate_mutations_from HOT 3
- Hypothesis 6.100.8 introduced a regression in `hypothesis.internal.conjecture.engine.ConjectureRunner` HOT 1
- In `st.from_type()`, handle `typing.Unpack` like `annotated_types.GroupedMetadata`
- PyCon US 2024 sprints! HOT 3
- Tests fail with StopTest (OVERRUN) when generating a random integer (strategies.randoms) HOT 2
- Filter-rewriting for comparisons on dates, times, and datetimes
- `test_drawing_from_recursive_strategy_is_thread_safe` failed on Python 3.13.0b1
- Improve our internal coverage tests HOT 3
- Error when using from_type with optional integers with numeric constraints HOT 8
- Follow up on IR shrinking tasks
- `st.from_regex()` alphabet improvements
- Busy loop randomly runs 6x slower causing flaky DeadlineExceeded errors HOT 5
- Issues with django.forms.ModelChoiceField and ModelMultipleChoiceField HOT 1
- example generation regression between `6.47.0` -> `6.103.1` HOT 1
- `hypothesis.extra.pandas`: generate timezone-aware datetime columns
- Warning from tracer causes Flaky HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hypothesis.