GithubHelp home page GithubHelp logo

Comments (7)

tybug avatar tybug commented on July 18, 2024 2

if you had 1000 possible values, it'd take 1000 samples to enumerate all of them but ~7000 in expectation if you were generating randomly

This makes sense! Viewed through this lens, Hypothesis only sort of enumerates. When generating a new value (DataTree.generate_novel_prefix internally), we try rejection sampling first: given a draw_integer(0, 1000) ir node, we randomly and uniformly generate a value in (0, 1000), resampling as necessary. If we ran this to completion, our EV is what you stated. But what we actually do is rejection sample 10 times, and if all 10 fail, exhaustively enumerate (to balance memory and randomness, enumerate only the first 100 children then sample from those). The intuition here is to try hard to satisfy the natural requested distribution, but fall back to enumeration if we're at the tail end and things get too difficult. I dunno what the EV of this is, but it feels around 2n? Not great, but not as bad as it could be, and avoids the tail end pitfall. 10 retries was a somewhat arbitrary choice here, fwiw.

We could potentially take max_examples into account when deciding when to enumerate...but I would be cautious of doing so. We use our max example budget for things other than generating novel elements, like optimizing target(), replaying from the databases, or mutating old inputs. Maybe you would argue we shouldn't be doing any of those things if we could exhaustively enumerate instead?

Based on what you're both saying, it seems like the timezone_keys issue is an unexpected one, whereas lists is expected, if a little unfortunate. Does that seem right? Should I keep submitting reports like these as they come up?

That would be my analysis 👍. And please do keep submitting reports of strategy inefficiencies, either in discarding/overruns or in duplicates! Some of them probably would have been improved in time anyway, but I suspect there are a fair number of strategies which will require manual tweaking even after the ir. Either because they were always inefficient and nobody noticed, or we intentionally chose an inefficient generation for e.g. good shrinking behavior which has since been alleviated by the ir.

from hypothesis.

tybug avatar tybug commented on July 18, 2024 1

An update: timezone_keys duplicates are from generate_mutations_from, which still operates on sub-ir examples and bypasses datatree-based duplicate detection. This should be fixed once we migrate generate_mutations_from to the ir, which is in turn easier once we finish the shrinker migration (pr soon!) and remove sub-ir examples.

I expect lots of other strategies are similarly over-duplicated due to this, including vanilla integers.

from hypothesis.

tybug avatar tybug commented on July 18, 2024

Those discards are actually mostly strategy-independent, and come from our logic of capping the size of early inputs. In our first roughly 10% of inputs, if we generate something too large, we'll throw it away. This shows up as overruns because we stop it early by setting a low max_length of the ConjectureData:

max_length = min(len(prefix) + minimal_extension * 10, BUFFER_SIZE)

I don't know whether this behavior is ideal, but thankfully it at least won't compound in the way you were afraid of 😅. Some reasoning behind the change here: #2219.

we should theoretically be able to switch to enumeration 100% of the time.

We actually already try to enumerate all the time! More specifically, we track what inputs we previously generated, and avoid generating them again. If we exhaust this choice tree (called DataTree internally), we terminate early.

This deduplication used to happen at the bitstream level, but now happens at the ir level — one of the big selling points of the ir (#3921). ex: [True, 2, True, 5, False] represents the list [2, 5] from st.lists(st.integers()), where the boolean draws are the list asking if it should generate another element.

This deduplication is better than it used to be, because while neither bitstream ↦ input nor ir ↦ input is injective, the latter is much more injective than the former. But we're still in the midst of the ir migration and haven't seen the full benefits yet. My intuition is timezone_keys — which is basically sampled_from internally — should be perfectly tracked by the ir....but I haven't looked at the details of sampled_from to know for sure.

It's great to have an example of a strategy, timezone_keys(allow_prefix=False), which should have good deduplication, but doesn't — thanks!

Here's a neat illustration of our duplication tracking:

from collections import Counter
seen = []
@given(st.integers(0, 50))
@settings(database=None)
def f(n):
    seen.append(n)
f()
print(len(set(seen)))
print(Counter(seen))
51
Counter({49: 3, 10: 3, 26: 3, 44: 2, 29: 2, 41: 2, 16: 2, 15: 2, 28: 2, 24: 2, 0: 1, 5: 1, 43: 1, 45: 1, 27: 1, 47: 1, 39: 1, 3: 1, 2: 1, 9: 1, 8: 1, 17: 1, 4: 1, 42: 1, 13: 1, 35: 1, 23: 1, 32: 1, 31: 1, 18: 1, 1: 1, 50: 1, 38: 1, 33: 1, 21: 1, 7: 1, 19: 1, 22: 1, 12: 1, 40: 1, 20: 1, 25: 1, 48: 1, 37: 1, 11: 1, 6: 1, 14: 1, 30: 1, 46: 1, 36: 1, 34: 1})

My guess is 49 is duplicated because we upweight near-endpoints on integers, outside of the ir layer (something I'm looking to improve!). I actually have no clue why other numbers like 10 or 26 are overrepresented, though. Something to look into...

from hypothesis.

Zac-HD avatar Zac-HD commented on July 18, 2024

timezone_keys(allow_prefix=False) is literally just a sampled_from(), which should be bijective with the IR via draw_integer() if there are no filters on it. Weird 🤔

More generally, thanks for the report, and thanks Liam for covering what I'd say! I suspect that a bunch of this will go away naturally by the time we finish migrating to the IR, but we should definitely track such issues to make sure we fix them.

from hypothesis.

hgoldstein95 avatar hgoldstein95 commented on July 18, 2024

Thanks for all of that context, both of you!

One thing I want to clarify: when I say "enumeration" I mean something slightly different from "remembering what we've generated and not generating that again". If you know ahead of time that you'll want every possible value from a strategy, you should be able to (modulo some implementation details) just list them off, one by one, without making random choices at all. This is much more efficient and effective than generating randomly and remembering previously generated values: if you had 1000 possible values, it'd take 1000 samples to enumerate all of them but ~7000 in expectation if you were generating randomly (assuming a uniform distribution, which Hypothesis doesn't provide). Does that make sense? Or did I misunderstand what you were getting at?

Based on what you're both saying, it seems like the timezone_keys issue is an unexpected one, whereas lists is expected, if a little unfortunate. Does that seem right? Should I keep submitting reports like these as they come up?

from hypothesis.

hgoldstein95 avatar hgoldstein95 commented on July 18, 2024

Yeah, I think I would argue that if we can conservatively conclude that exhaustive enumeration would require fewer inputs than our limit, we should just fall back to that — I don't see a downside as long as that estimate is accurate.

from hypothesis.

tybug avatar tybug commented on July 18, 2024

With #4007, we now deduplicate timezone_keys as expected.

The only remaining planned win for deduplication is st.integers(min_value=n, max_value=m) with m - n > 127, where we currently draw two integers internally but ideally draw only one. Otherwise, the latest release should have all the deduplication benefits we expect to get from the IR. If you discover otherwise, we'd love to know – and thanks for the report here!

from hypothesis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.