zac-hd / hypofuzz Goto Github PK

View Code? Open in Web Editor NEW

70.0 8.0 3.0 6.03 MB

Adaptive fuzzing of Hypothesis tests

Home Page: https://hypofuzz.com/docs

License: GNU Affero General Public License v3.0

Python 71.23% TeX 28.77%

fuzzing hypothesis testing

hypofuzz's Introduction

HypoFuzz

Adaptive fuzzing of Hypothesis tests.

Property-based approaches help you to write better tests which find more bugs, but don't have great ways to exchange much more CPU time for more bugs. The goal of this project is to bring togther the best parts of fuzzing and PBT.

Motivation

You can run a traditional fuzzer like AFL on Hypothesis tests to get basic coverage guidance. This works OK, but there's a lot of performance overhead. Installing, configuring, and connecting all the parts is a pain, and because it assumes one fuzz target per core you probably can't scale up far enough to fuzz your whole test suite.

Alternatively, you can just run Hypothesis with a large max_examples setting. This also works pretty well, but doesn't get the benefits of coverage guidance and you have to guess how long it'll take to run the tests - each gets the same budget.

HypoFuzz solves all of these problems, and more!

Features

Interleave execution of many test functions
Prioritise functions where we expect to make progress
Coverage-guided exploration of your system-under-test
Seamless python-native and CLI integrations (replaces the pytest command)
Web-based time-travel debugging with PyTrace (automatic if you pip install hypofuzz[pytrace])

Read more about HypoFuzz at https://hypofuzz.com/docs/, including the changelog.

hypofuzz's People

Contributors

Stargazers

Watchers

Forkers

agucova cheukting tybug

hypofuzz's Issues

Incompatibility with hypothesis >= 6.94.0

There is a call too prep_args_kwargs_from_strategies which appears to be wrong now at https://github.com/Zac-HD/hypofuzz/blob/master/src/hypofuzz/hy.py#L277-L280

a, kw, argslices = context.prep_args_kwargs_from_strategies(
    (), self.__stuff.given_kwargs
)
assert not a, "strategies all moved to kwargs by now"

When I switch it to the following I can get past it and seem to run the tests, though I do not know if there are other implications.

kw, argslices = context.prep_args_kwargs_from_strategies(
    self.__stuff.given_kwargs
)

hypofuzz isn't compatible with current versions of pytest

When trying to run hypofuzz, I get this error:

<rest of backtrace snipped as it doesn't say much about what's broken>
INTERNALERROR>   File "~/project/.tox/fuzz/lib/python3.11/site-packages/hypofuzz/interface.py", line 40, in pytest_collection_finish
INTERNALERROR>     all_autouse = set(manager._getautousenames(item.nodeid))
INTERNALERROR>                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "~/project/.tox/fuzz/lib/python3.11/site-packages/_pytest/fixtures.py", line 1502, in _getautousenames
INTERNALERROR>     for parentnode in node.listchain():
INTERNALERROR>                       ^^^^^^^^^^^^^^
INTERNALERROR> AttributeError: 'str' object has no attribute 'listchain'

As it turns out, _getautousenames takes an Item in current/recent versions of pytest, not a string nodeid. I experienced this with Hypofuzz 24.2.3, Hypothesis 6.100.1, and PyTest 8.1.1, although the bug still appears to be present at HEAD.

Docs: example GitHub Actions configuration with CI + a fuzzing cronjob with shared database

I suspect that a copy-pasteable example configuration would get many people over the line from "vaguely interested" to "actually fuzzing", and so this should be a pretty high priority. I also expect it's a little fiddly to get database sharing set up and working properly between different jobs, so implementing and testing that seems pretty valuable as a FUD-reducer.

When we've tested this out (e.g. on `shed) we should let Pydantic know about it: pydantic/pydantic#4287 (comment)

Fix incompatibility with Pytest 8.x

When I attempt to run hypofuzz on a piece of software with an extensive (mostly non-hypothesis) test suite, it fails. The error message leaves me very unclear what the problem is or what to do about it:

The software is PINT: https://github.com/nanograv/PINT

The hypothesis-based tests include tests of time conversion precision at the nanosecond level for time scales including leap seconds, over time spans of decades. There are also a variety of tests checking support for ill-defined traditional data formats. The test suite has been running successfully using hypothesis for years.

$ hypothesis fuzz -n 4
================================================================================= test session starts =================================================================================
platform linux -- Python 3.11.4, pytest-8.0.0, pluggy-1.4.0
rootdir: /home/peridot/projects/pint/PINT
configfile: pytest.ini
plugins: xdist-3.5.0, dash-2.15.0, hypothesis-6.98.3, anyio-4.2.0, cov-4.1.0
collected 2885 items / 2790 deselected / 95 selected

<Dir PINT>
  <Dir tests>
    <Module test_derived_quantities.py>
      <Function test_companion_mass>
      <Function test_companion_mass_array>
      <Function test_pulsar_mass>
      <Function test_pulsar_mass_array>
      <Function test_omdot_to_mtot>
      <Function test_a1sini_Mc>
      <Function test_a1sini_Mp>
      <Function test_a1sini_Mc_array>
      <Function test_a1sini_Mp_array>
    <Module test_parfile_writing_format.py>
      <Function test_roundtrip>
    <Module test_precision.py>
      <Function test_str_roundtrip_is_exact>
      <Function test_longdouble_str_roundtrip_is_exact>
      <Function test_longdouble2str_same_as_str_and_repr>
      <Function test_time_construction_jds_exact[tai]>
      <Function test_time_construction_jds_exact[tt]>
      <Function test_time_construction_jds_exact[tdb]>
      <Function test_time_construction_mjds_preserved>
      <Function test_time_construction_mjd_versus_jd[tai]>
      <Function test_time_construction_mjd_versus_jd[tt]>
      <Function test_time_construction_mjd_versus_jd[tdb]>
      <Function test_time_to_longdouble_via_jd[tai]>
      <Function test_time_to_longdouble_via_jd[tt]>
      <Function test_time_to_longdouble_via_jd[tdb]>
      <Function test_time_to_longdouble[tai]>
      <Function test_time_to_longdouble[tt]>
      <Function test_time_to_longdouble[tdb]>
      <Function test_time_to_longdouble_utc[mjd]>
      <Function test_time_to_longdouble_utc[pulsar_mjd]>
      <Function test_time_from_longdouble[tai]>
      <Function test_time_from_longdouble[tt]>
      <Function test_time_from_longdouble[tdb]>
      <Function test_time_from_longdouble_utc[mjd]>
      <Function test_time_from_longdouble_utc[pulsar_mjd]>
      <Function test_time_to_longdouble_close_to_time_to_mjd_string[mjd]>
      <Function test_time_to_longdouble_close_to_time_to_mjd_string[pulsar_mjd]>
      <Function test_time_to_longdouble_no_longer_than_time_to_mjd_string>
      <Function test_time_to_mjd_string_versus_longdouble[mjd]>
      <Function test_time_to_mjd_string_versus_longdouble[pulsar_mjd]>
      <Function test_time_to_mjd_string_versus_decimal[mjd]>
      <Function test_time_to_mjd_string_versus_decimal[pulsar_mjd]>
      <Function test_time_from_mjd_string_versus_longdouble_tai>
      <Function test_time_from_mjd_string_versus_longdouble_utc[mjd]>
      <Function test_time_from_mjd_string_versus_longdouble_utc[pulsar_mjd]>
      <Function test_pulsar_mjd_never_differs_too_much_from_mjd_tai>
      <Function test_pulsar_mjd_never_differs_too_much_from_mjd_utc>
      <Function test_time_from_mjd_string_accuracy_vs_longdouble[pulsar_mjd]>
      <Function test_time_from_mjd_string_accuracy_vs_longdouble[mjd]>
      <Function test_time_from_mjd_string_roundtrip[pulsar_mjd]>
      <Function test_time_from_mjd_string_roundtrip[mjd]>
      <Function test_mjd_equals_pulsar_mjd_in_tai>
      <Function test_make_pulsar_mjd_ancient>
      <Function test_make_mjd_ancient>
      <Function test_pulsar_mjd_equals_mjd_on_non_leap_second_days>
      <Function test_pulsar_mjd_equals_mjd_on_leap_second_days>
      <Function test_pulsar_mjd_close_to_mjd_on_leap_second_days>
      <Function test_pulsar_mjd_proceeds_at_normal_rate_on_leap_second_days>
      <Function test_mjd_proceeds_slower_on_leap_second_days>
      <Function test_erfa_conversion_on_leap_sec_days>
      <Function test_erfa_conversion_normal>
      <Function test_d2tf_tf2d_roundtrip[8-1]>
      <Function test_d2tf_tf2d_roundtrip[9-1]>
      <Function test_d2tf_tf2d_roundtrip[10-100]>
      <Function test_d2tf_tf2d_roundtrip[11-1000]>
      <Function test_d2tf_tf2d_roundtrip[12-10000]>
      <Function test_mjd_jd_round_trip>
      <Function test_mjd_jd_pulsar_round_trip>
      <Function test_mjd_jd_pulsar_round_trip_leap_sec_day_edge>
      <Function test_str_to_mjds>
      <Function test_mjds_to_str>
      <Function test_mjds_to_str_roundtrip>
      <Function test_day_frac>
      <Function test_two_sum>
    <Module test_tim_writing.py>
      <Function test_flags>
    <Module test_toa_indexing.py>
      <Function test_select>
      <Function test_getitem_boolean>
      <Function test_getitem_where>
      <Function test_getitem_slice>
    <Module test_toa_reader.py>
      <Function test_numpy_clusterss>
      <Function test_contiguous_on_load>
    <Module test_toa_shuffle.py>
      <Function test_shuffle_toas_residuals_match>
      <Function test_shuffle_toas_chi2_match>
      <Function test_shuffle_toas_clock_corr>
    <Module test_utils.py>
      <Function test_posvel_slice_indexing>
      <Function test_posvel_broadcasts>
      <Function test_posvel_broadcast_retains_quantity>
      <Function test_mjds_to_str_array>
      <Function test_mjds_to_str_array_roundtrip_doesnt_crash>
      <Function test_mjds_to_str_array_roundtrip_close>
      <Function test_str_to_mjds_array>
      <Function test_mjds_to_jds_array>
      <Function test_mjds_to_jds_pulsar_array>
      <Function test_jds_to_mjds_array>
      <Function test_jds_to_mjds_pulsar_array>
      <Function test_compute_hash_detects_changes>
      <Function test_compute_hash_accepts_no_change>
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/_pytest/main.py", line 272, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>                          ^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/_pytest/main.py", line 325, in _main
INTERNALERROR>     config.hook.pytest_collection(session=session)
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 501, in __call__
INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_manager.py", line 119, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 138, in _multicall
INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 121, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/_pytest/logging.py", line 783, in pytest_collection
INTERNALERROR>     return (yield)
INTERNALERROR>             ^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 121, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/_pytest/warnings.py", line 118, in pytest_collection
INTERNALERROR>     return (yield)
INTERNALERROR>             ^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 121, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/_pytest/config/__init__.py", line 1365, in pytest_collection
INTERNALERROR>     return (yield)
INTERNALERROR>             ^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 102, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>           ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/_pytest/main.py", line 336, in pytest_collection
INTERNALERROR>     session.perform_collect()
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/_pytest/main.py", line 809, in perform_collect
INTERNALERROR>     hook.pytest_collection_finish(session=self)
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 501, in __call__
INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_manager.py", line 119, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 138, in _multicall
INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/pluggy/_callers.py", line 102, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>           ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "/home/peridot/projects/pint/venv/lib/python3.11/site-packages/hypofuzz/interface.py", line 37, in pytest_collection_finish
INTERNALERROR>     _, all_autouse, _ = manager.getfixtureclosure(
INTERNALERROR>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> TypeError: FixtureManager.getfixtureclosure() missing 1 required positional argument: 'ignore_args'

================================================================= 95/2885 tests collected (2790 deselected) in 35.81s =================================================================

Exiting because pytest returned exit code 3

Not working with hypothesis >= 6.72.2

The last version of hypothesis I can get to run with hypofuzz is 6.72.1. It appears to be related to refactoring of the core runner here: HypothesisWorks/hypothesis#3621

From 6.72.2 to 6.86.2 I get the following:

Process Process-2:
Traceback (most recent call last):
  File ".../lib/python3.11/site-packages/hypofuzz/hy.py", line 258, in _run_test_on
    args, kwargs = data.draw(self.__strategy)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/hypothesis/internal/conjecture/data.py", line 937, in draw
    strategy.validate()
    ^^^^^^^^^^^^^^^^^
AttributeError: 'Stuff' object has no attribute 'validate'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File ".../lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../lib/python3.11/site-packages/hypofuzz/interface.py", line 91, in _fuzz_several
    fuzz_several(*tests)
  File ".../lib/python3.11/site-packages/hypofuzz/hy.py", line 378, in fuzz_several
    t.run_one()
  File ".../lib/python3.11/site-packages/hypofuzz/hy.py", line 198, in run_one
    result = self._run_test_on(self.generate_prefix())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/hypofuzz/hy.py", line 275, in _run_test_on
    traceback.format_exception(etype=type(e), value=e, tb=tb)
TypeError: format_exception() got an unexpected keyword argument 'etype'
Found a failing input for every test!

And from 6.87.0 up to latest it breaks due to:

INTERNALERROR>   File ".../lib/python3.11/site-packages/hypofuzz/hy.py", line 89, in from_hypothesis_test
INTERNALERROR>     _, _, _, search_strategy = process_arguments_to_given(
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> ValueError: not enough values to unpack (expected 4, got 3)

Removing the extra _, on line 89 gets past that error, but results in the same error as above.

Unfortunately, I am not familiar enough with the workings of hypothesis / hypofuzz to figure out any solution. It doesn't seem simple to me, since the return types changed.

docs search on https://hypofuzz.com/docs/ does not work

Browser: Firefox 125.0.1 (64-bit) with clean profile

Example of a broken URL: https://hypofuzz.com/docs/search.html?q=stateful&check_keywords=yes&area=default

(I just typed in stateful into the search box and hit Enter key.)

Firefox console logs this:

Uncaught ReferenceError: jQuery is not defined
    <anonymous> https://hypofuzz.com/docs/search.html?q=stateful&check_keywords=yes&area=default:113
[search.html:113:7](https://hypofuzz.com/docs/search.html?q=stateful&check_keywords=yes&area=default)
    <anonymous> https://hypofuzz.com/docs/search.html?q=stateful&check_keywords=yes&area=default:113

Uncaught ReferenceError: jQuery is not defined
    <anonymous> https://hypofuzz.com/docs/search.html?q=stateful&check_keywords=yes&area=default:129
[search.html:129:5](https://hypofuzz.com/docs/search.html?q=stateful&check_keywords=yes&area=default)
    <anonymous> https://hypofuzz.com/docs/search.html?q=stateful&check_keywords=yes&area=default:129

I vaguely remember seeing this on other RTD sites as well as a temporary glitch - I think we solved it by updating deps and rebuilding? But I'm not sure and can't locate relevant records, sorry!

RuleBasedStateMachine tests do not work and are not documented

It seems that tests based on RuleBasedStateMachine are not supported. I've tried to full-text search for "RuleBased" and "stateful" in hypofuzz docs but have not seen any mention this is not supported.

After some digging I've found a mention in the sources:

# Skip state-machine classes, since they're not

but it's a bit mysterious :-)

Maybe collector could print a message when in encounters unsupported case instead of silently ignoring it?

Versions tested

hypofuzz 24.2.3
hypothesis 6.100.1
pytest 8.0.2

Steps to reproduce

Copy & paste example code from https://hypothesis.readthedocs.io/en/hypothesis-python-4.57.1/stateful.html into a file, say test_example.py.
Run hypothesis fuzz test_example.py

Output

Usage: hypothesis fuzz [OPTIONS] [-- PYTEST_ARGS]
Try 'hypothesis fuzz -h' for help.

Error: No property-based tests were collected

Further details

With a slight modification to hypofuzz/interface.py we can get more verbose output. Diff:

@@ -81,8 +81,8 @@ def _get_hypothesis_tests_with_pytest(args: Iterable[str]) -> List["FuzzProcess"
             ],
             plugins=[collector],
         )
+    print(out.getvalue())  # noqa
     if ret:
-        print(out.getvalue())  # noqa
         print(f"Exiting because pytest returned exit code {ret}")  # noqa
         sys.exit(ret)
     return collector.fuzz_targets

Output:

================================ test session starts =================================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.5.0
rootdir: /tmp/ste/orig
plugins: hypothesis-6.100.1, dash-2.16.1
collected 1 item

<Dir orig>
  <Module test_example.py>
    <UnitTestCase TestDBComparison>
      <TestCaseFunction runTest>
crashed in test_example.py::TestDBComparison::runTest 'function' object has no attribute 'hypothesis'

============================= 1 test collected in 0.02s ==============================

`raise NotImplementedError("unreachable")` when a falsifying case is found

👋🏽

I was trying to use hypothesis + hypofuzz to fuzz a hand-written parser in https://github.com/pypa/packaging/, after we got a report of a parser regression in pypa/packaging#618. In my first attempt to do so, I seem to have successfully hit a NotImplementedError, details below. :)

Steps to reproduce

Create a Python 3.11 venv and activate.
pip install hypofuzz
pip install git+https://github.com/pypa/packaging.git@606c71acce93d04e778cdbdea16231b36d6b870f (get the current main)
Have a test like:

from hypothesis import given, strategies as st

from packaging._tokenizer import Tokenizer


@given(st.from_regex(r"[a-zA-Z\_\.\-]+", fullmatch=True))
def test_names(name: str) -> None:
    # GIVEN
    source = name

    # WHEN
    tokens = Tokenizer(source)

    # THEN
    assert tokens.match("IDENTIFIER")

Run hypothesis fuzz -- {the-test-file-above}.
Notice a traceback after a few seconds.

Output

❯ hypothesis fuzz -- tests/test_requirements_tokeniser.py
using up to 1 processes to fuzz:
    tests/test_requirements_tokeniser.py::test_names


        Now serving dashboard at  http://localhost:9999/

 * Serving Flask app 'hypofuzz.dashboard'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://localhost:9999
Press CTRL+C to quit
127.0.0.1 - - [03/Dec/2022 12:16:11] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:11] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:11] "POST /_dash-update-component HTTP/1.1" 200 -
Process Process-2:
Traceback (most recent call last):
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/hypofuzz/hy.py", line 264, in _run_test_on
    self.__test_fn(*args, **kwargs)
  File "/Users/pradyunsg/Developer/github/packaging/tests/test_requirements_tokeniser.py", line 15, in test_names
    assert tokens.match("IDENTIFIER")
AssertionError: assert False
 +  where False = <bound method Tokenizer.match of <packaging._tokenizer.Tokenizer object at 0x106154490>>('IDENTIFIER')
 +    where <bound method Tokenizer.match of <packaging._tokenizer.Tokenizer object at 0x106154490>> = <packaging._tokenizer.Tokenizer object at 0x106154490>.match

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/pradyunsg/.asdf/installs/python/3.11.0/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/pradyunsg/.asdf/installs/python/3.11.0/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/hypofuzz/interface.py", line 91, in _fuzz_several
    fuzz_several(*tests)
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/hypofuzz/hy.py", line 381, in fuzz_several
    targets[0].run_one()
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/hypofuzz/hy.py", line 198, in run_one
    result = self._run_test_on(self.generate_prefix())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/hypofuzz/hy.py", line 275, in _run_test_on
    traceback.format_exception(etype=type(e), value=e, tb=tb)
TypeError: format_exception() got an unexpected keyword argument 'etype'
Traceback (most recent call last):
  File "/Users/pradyunsg/Developer/github/packaging/.venv/bin/hypothesis", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/Developer/github/packaging/.venv/lib/python3.11/site-packages/hypofuzz/entrypoint.py", line 93, in fuzz
    raise NotImplementedError("unreachable")
NotImplementedError: unreachable
127.0.0.1 - - [03/Dec/2022 12:16:16] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:16] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:16] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:21] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:21] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:21] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:26] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:26] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:26] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:31] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:31] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:31] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:36] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:36] "POST /_dash-update-component HTTP/1.1" 200 -
127.0.0.1 - - [03/Dec/2022 12:16:36] "POST /_dash-update-component HTTP/1.1" 200 -
^CException ignored in atexit callback: <function _exit_function at 0x11fa62f20>
Traceback (most recent call last):
  File "/Users/pradyunsg/.asdf/installs/python/3.11.0/lib/python3.11/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/Users/pradyunsg/.asdf/installs/python/3.11.0/lib/python3.11/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/.asdf/installs/python/3.11.0/lib/python3.11/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pradyunsg/.asdf/installs/python/3.11.0/lib/python3.11/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt:

Shrinker missing positional argument

I tried running hypofuzz, and this is the error I'm getting:

Traceback (most recent call last):
  File "C:\Users\tamir\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
  File "C:\Users\tamir\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypofuzz\interface.py", line 91, in _fuzz_several
    fuzz_several(*tests)
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypofuzz\hy.py", line 378, in fuzz_several
    t.run_one()
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypofuzz\hy.py", line 203, in run_one
    shrinker = Shrinker(
               ^^^^^^^^^
TypeError: Shrinker.__init__() missing 1 required positional argument: 'explain'

I'm running:

Windows 11
Python 3.11
hypofuzz 23.4.1

Another issue with shrinking

There I go again...
But this time, it did find a real crashing input for my program!

As for the issue -

Running with 23.5.2.

When running the following test:

@given(text())
def test_trivial(s):
    raise Exception()

I get the following error:

Traceback (most recent call last):
  File "C:\Users\tamir\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
  File "C:\Users\tamir\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypofuzz\interface.py", line 91, in _fuzz_several
    fuzz_several(*tests)
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypofuzz\hy.py", line 390, in fuzz_several
    t.run_one()
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypofuzz\hy.py", line 214, in run_one
    shrinker.shrink()
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypothesis\internal\conjecture\shrinker.py", line 442, in shrink
    self.explain()
  File "C:\Code\crafting-interpreters-py\.venv\Lib\site-packages\hypothesis\internal\conjecture\shrinker.py", line 511, in explain
    seen_passing_buffers = self.engine.passing_buffers(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'EngineStub' object has no attribute 'passing_buffers'

Pythia-style predictive fuzzing

We've got basic predictions of branches-to-next-new-branch and branches-to-first-bug in the dashboard, but we can do more with this:

use predictive fuzzing like Pythia for prioritization of tests to fuzz; currently using number of inputs since last discovery
report expected runtime until next branch/bug in the dashboard
clearly explain why residual risk estimation is awesome on the website
allow user configuration of a cost per hour (per core?), and display cost estimates too
report a global estimate aggregated across running tests (requires dashboard to know/estimate number of active workers)
have a configurable threshold 'price per failure' at which to stop fuzzing

It would be really, really useful to collect lots of empirical measurements here in order to tune estimates and get around adaptive biases at least a little (see paper). On the other hand nobody likes telemetry.

Recognize, and stop fuzzing, when a test is exhausted

Hypothesis can detect when a strategy is exhausted, i.e. when all possible values have been tested, and will stop early in that case. For unit-testing-like workloads this typically only happens for simple strategies such as st.booleans() or small ranges of st.integers(), but in a long fuzzing run it could conceivably happen for considerably larger sets of values.

While not a high priority, it would be nice to implement this for HypoFuzz if it's possible to do so without consuming too much memory.

See: https://mboehme.github.io/paper/ICSE23.Effectiveness.pdf

Convert roadmap notes into issues

https://hypofuzz.com/docs/roadmap.html

Better mutation logic

Better mutation operators

structure-aware operators for crossover / replacement / splicing etc
track provenance information for MOpt-style adjustment of frequencies
validity-aware mutations (Zest/RLCheck), based on structure
Nezha-like differential coverage via dynamic contexts?

Improved prioritization

Typically, fuzzers haven't really been guided towards inputs which improve branch coverage, as they have been good at exploiting them once found at random. We can do that, but we can probably also do better.

use CFG from coverage to tell if new branches are actually available from a given path. If not, we can hit it less often.
Note that branch coverage != available bugs; the control flow graph is not identical to the behaviour partition of the program.
try using a custom trace function, investigate performance and use of alternative coverage metrics (e.g. length-n path segments, callstack-aware coverage, etc.)
fuzz arbitrary scores with hypothesis.target() (see FuzzFactory)
exploit VCS metadata, i.e. target recently-changed parts of the SUT and new / recently changed tests (c.f. pypi-testmon)

Ideas for fuzzing stateful tests

Stateful Greybox Fuzzing (https://mboehme.github.io/paper/USENIX22.pdf) has several nice tricks - and starting with explicitly-stateful tests lets us avoid the state-machine inference step. Some specific ideas:

each rule method could trigger an event() or perhaps target(-n_steps) (to prioritize reaching each quickly)
treat event-virtual-branches as a separate category rather than mixing with code branches - we want to be even across those and coverage as separate dimensions because we'll never care about not-super-rare events otherwise.
[fancier options from the paper] are plausibly nice but let's get baselines first.

Explore other dashboard tech options

Maybe I just need to refactor things and then Plotly Dash will be satisfyingly declarative and performant.

Alternatively I might want to explore other frameworks - e.g. Bokeh, or go up a level to Holoviews. Also worth checking whether using Pyodide in the browser is a better user experience? I think I'm still better off sticking to Python over JS for this, since I'm much better at the former.

Add covering inputs as `@example(...)`s in the code

Here's a neat workflow, combining the benefits for PBT and fuzzing with deterministic tests:

Use the fuzzer to find a reasonably diverse set of covering examples (already works)
Automatically edit them into the code as explicit @example(...) cases (this issue!)
Run your standard CI with only full-explicit deterministic examples (already works; see also python/cpython#22863)

So what will it take to have automatically-maintained explicit examples? Some quick notes:

This only works for test cases which can be written using the @example() decorator, which rules out stateful tests or those using st.data(). We'll also have trouble with reprs that can't be eval'd back to an equivalent object - we might get a short distance by representing objects from st.builds() as the result of the call (also useful for HypothesisWorks/hypothesis#3411), but this seems like a fundamental limitation.
We need to know where the test is, and how to insert the decorator. Introspection works, albeit with some pretty painful edge cases we'll need to bail out on, and I think LibCST should make the latter pretty easy - we can construct a string call, attempt to parse it, and then insert it into the decorator list.
My preferred UX for this is "HypoFuzz dumps a <hash>.patch file and the user does git apply ...". We can dump the file on disk, and also make it downloadable from the dashboard for remote use. The patch shouldn't be too ugly, e.g. one line per arg, but users are expected to run their choice of autoformatter.
I mentioned "automatically-maintained": it'd be nice to remove previously-covering examples when the set updates; or crucial if we haven't shrunk to a minimal covering example (and currently we don't!). This probably means using magic comments to distinguish human-added examples from machine-maintained covering examples. Note that fuzzer-discovered minimal failing examples might be automatically added to the former set!

This seems fiddly, but not actually that hard - we already report covering examples on the dashboard, after all. No timeline on when I'll get to this, but I'd be very happy to provide advice and code review to anyone interested in contributing 🙂

Hypofuzz lacks setup hooks of _any_ sort

Right now, Hypofuzz provides no hooks for when it starts running a test target. This is a problem because the only time prior to an individual fuzz run one can do setup work is at import time, which makes it nearly impossible to not do that work when running the testsuite conventionally (i.e. with pytest and Hypothesis, not Hypofuzz).

At the very least, a hook called from FuzzProcess.startup would let me do Hypofuzz-only setup steps instead of having to try to detect whether we're running under pytest or not during import, which may not work anyway since Hypofuzz delegates test collection (and importing) to pytest.

Incorrect data aggregation for dashboard plots?

I got a bug report by email, saying that there's a bug in the dashboard where the lengths of columns are different to the length of the index, for various columns in some pandas DataFrame inside the px.line() call. Perhaps sometimes only some keys are appended?

Putting this on a more principled (and tested!) basis might be nice even before the big changes in #3.

Hypofuzz not collecting any property-based tests

Under pypy matching python 3.9, hypothesis 6.87, and I'm not sure what version of hypofuzz, I can't get hypofuzz to collect any property based tests, even when I explicitly specify them.

frinstance:

(venv2) [alex@localhost traveller_pyroute]$ hypothesis fuzz -- -k test_parse_line_to_star_and_back
Usage: hypothesis fuzz [OPTIONS] [-- PYTEST_ARGS]
Try 'hypothesis fuzz -h' for help.

Error: No property-based tests were collected
(venv2) [alex@localhost traveller_pyroute]$

Here's the test_parse_line_to_star_and_back test, which lives in Tests/Hypothesis/testStar.py, inside the testStar class (sans the 31 examples accumulated running under classic hypothesis)

    """
    Given a regex-matching string that results in a Star object when parsed, that Star should parse cleanly to an input
    line
    """
    @given(from_regex(regex=Star.starline,
                      alphabet='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWYXZ -{}()[]?\'+*'))
    @settings(
        suppress_health_check=[HealthCheck(3), HealthCheck(2)],  # suppress slow-data health check, too-much filtering
        deadline=timedelta(1000))
    def test_parse_line_to_star_and_back(self, s):
        sector = Sector(' Core', ' 0, 0')
        pop_code = 'scaled'
        ru_calc = 'scaled'
        foo = Star.parse_line_into_star(s, sector, pop_code, ru_calc)
        assume(foo is not None)
        foo.trim_self_ownership()
        foo.trim_self_colonisation()
        self.assertIsNotNone(foo._hash, "Hash not calculated for original star")

        foo.index = 0
        foo.allegiance_base = foo.alg_base_code
        self.assertTrue(foo.is_well_formed())

        parsed_line = foo.parse_to_line()
        self.assertIsNotNone(parsed_line)
        self.assertLessEqual(80, len(parsed_line), "Round-trip line unexpectedly short")

        nu_foo = Star.parse_line_into_star(parsed_line, sector, pop_code, ru_calc)
        self.assertTrue(isinstance(nu_foo, Star), "Round-trip line did not re-parse")
        nu_foo.index = 0
        nu_foo.allegiance_base = nu_foo.alg_base_code
        self.assertTrue(nu_foo.is_well_formed(log=False))
        self.assertIsNotNone(nu_foo._hash, "Hash not calculated for re-parsed star")

        self.assertEqual(foo, nu_foo, "Re-parsed star not __eq__ to original star.  Hypothesis input: " + s + '\n')
        self.assertEqual(
            str(foo.tradeCode),
            str(nu_foo.tradeCode),
            "Re-parsed trade codes not equal to original trade codes.  Hypothesis input: " + s + '\n'
        )
        self.assertEqual(
            len(foo.star_list),
            len(nu_foo.star_list),
            "Re-parsed star list different length to original star list.  Hypothesis input: " + s + '\n'
        )

        nu_parsed_line = nu_foo.parse_to_line()
        self.assertEqual(
            parsed_line,
            nu_parsed_line,
            "New reparsed starline does not equal original parse-to-line output.  Hypothesis input: " + s + '\n'
        )

Running that test thru pytest results in pytest seeing and collecting it:

(venv2) [alex@localhost traveller_pyroute]$ pytest -k test_parse_line_to_star_and_back
========================================================================================================= test session starts =========================================================================================================
platform linux -- Python 3.9.17[pypy-7.3.12-final], pytest-7.4.2, pluggy-1.2.0
Using --randomly-seed=2622186668
rootdir: /home/alex/gitstuf/traveller_pyroute
configfile: pytest.ini
testpaths: Tests, Tests/Pathfinding, Tests/Position
plugins: randomly-3.15.0, hypothesis-6.87.3, console-scripts-1.4.1, subtests-0.11.0
collected 225 items / 224 deselected / 1 selected                                                                                                                                                                                     

Tests/Hypothesis/testStar.py

Before I get too wound up, I'd appreciate help figuring/ruling out what I've done wrong.

Incorrect elapsed time in dashboard

When running tests - it seems that the the "elapsed time" displayed is around a fifth of the actual time (at least on my machine).

I ran a test for 10 minutes, and it listed 2 minutes as the elapsed time.

Observability ideas

The current dashboard is a good start, but there's plenty of room for improvement.

Cuumulative and per-input coverage reports like :command:coverage html
Compare branch-hit frequency between blackbox and mutational modes, to assist in designing strategies (see this blog post).
Abstracting away inputs like e.g. https://github.com/gamozolabs/cookie_dough
Improved time-travel debugging (see eg this comment)

Failures not reported on python >= 3.10

hy.py fails to complete its task of collecting an actual failure because it fails itself within _run_test_on() for python 3.10 or newer like so:

traceback.format_exception(etype=type(e), value=e, tb=tb)
TypeError: format_exception() got an unexpected keyword argument 'etype'

I think this is because the signature of traceback.format_exception() changed from 3.9 to 3.10 to make the first argument positional only.

Workflow and development lifecycle improvements

expand user-facing documentation, examples, tips, etc.
better reporting of test collection, e.g. tests skipped due to use of fixtures
warn about tests where found examples can't be replayed because of settings decorator with derandomize=True or database=None; recommend profiles instead
example configuration for use in GitHub Actions, including shared database

Dashboard Issues

When looking at a specific test, the "Minimal covering examples" just shows the same example repeated dozens of times. I know it is going through a bunch of different ones from the cache, and not finding any more coverage. Just not displaying all of them.

And "See patches with covering and/or failing examples" throws a ValueError complaining about the data frame not having any columns.

I am running it on Python 3.11, and these are the versions of packages I think could be relevant. They are the latest version of everything except pytest, which has a compatibility issue.

pytest == 7.4.4
hypofuzz == 23.12.1
hypothesis == 6.93.2
dash == 2.15.0
flask == 3.0.1
jinja2 == 3.1.3
numpy == 1.26.3
pandas == 2.0.3
plotly == 5.18.0

Construct and use a 'fuzzing dictionary'

In fuzzing, a "dictionary" is a corpus of known-interesting fragments (boundary values, html tags, etc.) that can be mixed in with randomly-generated or mutated data to increase the chance of stumbling across interesting bugs.

We kinda support doing this with Hypothesis for some types already; it's how we boost the chances of boundary integers and "interesting" floats. However there's not currently any mechanism for adding to the pool at runtime, and adding one will take some care to ensure that we can still replay failing examples without that runtime pool. See also HypothesisWorks/hypothesis#3086 and HypothesisWorks/hypothesis#3127 (comment).

Once we've got that, the standard easy way to get a dictionary is to run strings on your binary. The natural equivalent is to grab our Python source code and collect all the ast.Constant values! (excluding perhaps long strings, which are likely docstrings)

A more advanced trick, shading into full research project, would be to investigate Redqueen-style tracking. For example, "a string in the input matched against this regex pattern in the code, so try generating strings matching that pattern".

Pytest parameterize support

I know the docs say fixtures are not supported, and that's fine, I don't use them with one exception:

@pytest.mark.parameterize(cls=getSubclasses())
@given(...)
def test_subclass(cls, ...)

Currently I work around this by just copying the function and hardcoding it, but there are 20+ subclasses.

I get that fixtures introduce state that could muddy things. But I don't believe parameterize does?

Database-centric architecture for communication, persistence, and autoscaling

The Status Quo

This is going to be substantial architecture overhaul, so let's start with how things currently work: a HypoFuzz run has three basic parts:

The Hypothesis database, a key-value store of failing and covering examples we've seen in previous runs (or other workers in this run)
The worker processes, which spend their time executing test cases for a variety of test functions, plus some 'meta' scheduling
The dashboard process, which serves a webpage showing how things are going, based on information sent by the workers over http.

In the current design, this is fundamentally a run-it-on-one-box kind of system: the tests are divided up between workers at startup time (or maybe run on every worker concurrently; the workers are fine with this though the dashboard isn't), and while the workers can reload the previous examples everything else is as if it were the first run ever - with some hit to efficiency and the clarity of statistics.

Goal: support a system where workers can come and go, for example to soak up idle CPU time as a low-priority autoscaling group on a cluster, and the fuzzing system overall keeps humming along.

Solution: lean on the database

If our problem is that information is neither persisted nor well distributed, let's solve that with the Hypothesis database! This is a very simple key-value store where keys are bytestrings and values are sets of bytestrings, with create/read/delete operations. The most common implementation is on the user's local filesystem, but there's also a Redis backend and it's trivial to write more.

What problems does this solve, and create?

✨ Workers could write metadata to the database (in some disjoint keyspace), meaning that the dashboard could show information regardless of whether a worker is currently running - it'd just be a view over the database (generally good design principle!), and update by polling at whatever frequency we wanted.
- ✨ We no longer need any HTTP traffic between components of the system; subtracting parts is underrated.
- ✨ If we had two separate systems ("partition tolerance"), we can just merge the databases (via e.g. MultiplexedDatabase) and keep going
- 🚧 We have to handle stale data, including from runs that diverged or never even had a common prefix. This is basically fine; "keep the examples from everything and discard all the metadata" is totally valid and anything fancier is a bonus. We'll probably try to construct a 'best guess' metadata though, e.g. keeping the longest.
  - For recently-diverged workers, which is a common case when two are fuzzing the same target, we can just sum the effort spent fuzzing in the same most-recent state. More complicated schemes run up against the question "to what degree should we reset state estimation when we discover new behavior", which is to my knowledge an open research problem (see Estimating Residual Risk in Greybox Fuzzing).
- 🚧 Worse, we have to handle data from different code: database keys are derived from a hash of the test function, but are necessarily consistent across changes in the code under test. Supposing we restart the fuzzer with a new bug-containing commit: the coverage information we have saved is likely to be wrong, and we might even have deprioritized testing that area!
  - Again, "keep the examples and ditch the metadata" would be OK here, though we'd need to track the commit that we're fuzzing. I'll continue assuming the presence of git; other VCS systems can be supported as demand arises.
  - Upside, if we're using VCS metadata we could prioritize fuzzing recently-changed code...
  - What about library versions though? Or operating systems? Or Python versions? To what degree should we distinguish these at the worker level, and/or in the dashboard?
- 🤔 Tracking provenance information about how we found each covering or failing example (e.g. blackbox/greybox/whitebox; for fuzzer which mutations from what seed, test-case-number to discover, etc) can be really helpful in visualizing and understanding how the process is going. Lots of interesting experiments and some literature exploiting this.
🚧 The dashboard process does need a local worker, in order to replay failing examples etc. - in not-that-rare pathological cases, this can produce more data than we'd want to persist for every test. Replaying live in the dashboard-worker also ensures that every test failure is reproducible.
- What if a test only fails on Windows, but the dashboard is on Linux? We do not want to delete that "fixed" failing example! Idea: give each test function an 'environment suffix', plus the ability to read from all other suffixes of the same test. That way we can fail to replay without risking deleting the case before it's reproduced in the environment it fails in.

Action Items

MVP is to ditch http and communicate all state through the database.

Metadata is just what we need to get the dashboard working, see that code for details. It's saved per-test by each worker.
Display whichever history is the longest, we really are going for MVP here. Handle the simple case: each test has a single worker.
Support for starting a dashboard without associated workers, beyond the minimum to display failing examples etc.
?? does this actually work at all without the fancier stuff ??

Better dashboard means we can get a little fancier about what we're displaying (mostly to keep these ideas out of the MVP):

Metadata includes:
- metadata-version-number
- git commit hash, maybe other environment metadata (package versions? OS? etc.)
- I have a marvelous design for an append-only log from which we can usually recover a linearizable tree. Entries include (worker UUID, hypothesis phase, start state, number of test cases, optional new state [, provenance etc. tbd]); states are hashes of interesting-origin or reason-to-keep-seed.
Pretty sure that if we emit to the log every time we switch test, find something new, or notice someone else found something new, this is sufficient to recover a tree; and linearizing it is usually lossless.
We can probably synchronize a lot of worker state from this log, in addition to using it for the dashboard

The full version is going to be an ongoing project. Once we get here, I'll aim to close this and split out more specific issues.

"found failing" for test that cannot fail.

I am getting the impression that I am using this wrong...

I have the following test:

from hypothesis import given
from hypothesis.strategies import text

@given(text())
def test_fuzz(s):
    return

I run it using:

hypothesis.exe fuzz -- .\tests\test_fuzz.py

It immediately reports that it found a failing input.
When I check the dashboard, I get an exception thrown from inside the Hypothesis code:

Am I doing something wrong, or is this a library issue?

Possible memory leak in dashboard?

I've had a report that there is a memory leak of sorts (or perhaps it’s just the dashboard point append?) that makes the fuzz process take 100G of RAM in a few hours. We should probably trace this with e.g. https://github.com/bloomberg/memray, and either fix the leak or compress the trace by dropping intermediate identical points - we only need first and last for the plot I think.

Surprisingly neither hypothesis nor hypofuzz finds the failing example, where there is a small failing example

The following test has the failing example "^RUN", but neither running hypothesis over night nor running hypofuzz for a couple of hours finds the failing string:

from hypothesis import given
import hypothesis.strategies as st


magic_words = ["RUN", "JUMP"]


def parse_spells(output: str) -> dict[str, str]:
    spells = {}
    for line in output.splitlines():
        tokens = line.split("^")
        if len(tokens) == 2:
            name, spell = tokens
            if spell not in magic_words:
                continue
            spells[name] = spell
    return spells


@given(st.text())
def test_parse_spells(text):
    assert parse_spells(text) == {}

This is a simplified version of something we had in production.

Try `slipcover` for coverage measurement, and `sys.monitoring` (in py3.12+)

https://github.com/plasma-umass/slipcover - if the overhead of adding instrumentation is low, this could be a big win - and amortized across many unlikely-to-find-new-branches executions anyway if not. Note that we don't want to remove the instrumentation once hit, since we collect coverage for each test case.

Split the HypoFuzz engine into a Hypothesis `backend` and an executor

Hypothesis has recently grown a notion of alternative backends, which use the new IR layer (HypothesisWorks/hypothesis#3921) to support e.g. symbolic execution (HypothesisWorks/hypothesis#3914).

Supporting @settings(backend="hypofuzz") would be quite useful - for example, as an easy way to work with Pytest fixtures, or to support fuzzing in environments where Pytest is not available at all (CPython alphas?).

The implementation isn't trivial, since we have to rearrange a lot of code to both fit with the IR-oriented, context-manager-scoped interface for backends - while also having a hook to allow for interleaving execution of multiple tests - but I don't think there are any fundamental difficulties.

This change is independent of #3, but strongly complimentary in practice.

Report coverage stability in the dashboard

We should also report coverage stability fraction, including a rating stable (100%), unstable (85%--100%), or serious problem (<85%"); and explain the difference between stability (=coverage) and flakiness (=outcome). Stability is mostly an efficiency thing; flakiness means your test is broken.

This mostly requires measuring both of these on the backend and then plumbing the data around, it's not hugely involved.