Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default runs too low #190

Open
gampleman opened this issue Jul 29, 2022 · 2 comments
Open

Default runs too low #190

gampleman opened this issue Jul 29, 2022 · 2 comments

Comments

@gampleman
Copy link
Contributor

A gripe I've had for a while is that the default runs of 100 is way too low to get decent coverage for most scenarios, and in my experience is pretty low for most test suites.

I recommend usually about 10,000 runs as a base number, then adjust based on desired run time.

I think from a DX perspective having it specified as the number of runs should really be considered as an abstraction leak. Property tests assert that a condition holds for all inputs meeting some criteria; the implementation detail of verifying that assertion is generating a certain number of samples, but the user doesn't necessarily have a great mental model of how many those samples should be (and indeed understanding this requires some fairly non-trivial statistical understanding, as well as knowledge about implementation details of the fuzzers, etc).

Here are some practical suggestions on how to improve this:

  1. Let the user specify (wall-clock) time that they want the test suite to run for. This is nice, since for instance in watch mode we might want to prioritise fast iteration time, in CI we often have other jobs running in parallel so we have a pretty good idea how much "spare" time our tests can take.
  2. Specify a minimum coverage as a percentage (this would make more sense with Fuzzers should exhaustively check all values of a type if possible/reasonable #188), i.e. we want to validate a certain percentage of the available input space. (Ergonomically this might be nicer to specify in some smaller unit, like 1/1,000,000 or some such). This is nice in the sense that it directly specifies our certainty of not having a bug :)
  3. Have labelling Discuss adding labels for fuzzed values #94 and run enough tests to achieve enough distributions on each label.

A separate issue that could be resolved much more quickly (and is also breaking) is that Test.fuzzWith expects an absolute number of runs. I think this is un-ergonomic, since it's a value one needs to keep messing with. A nicer design would be as a multiplier of the globally configured value. This can be used both for "this test is super slow, so let's not waste too much time testing it" to "this test has highly variable behaviour, so let's spend a lot of our time testing the input space", but let's the test runner also influence the total number of tests to run.

@Janiczek
Copy link
Collaborator

Janiczek commented Jul 29, 2022

I like the point 1) - in fact I'd like to (in addition to saying "run for N runs" and "run for N seconds") be able to say "run indefinitely", although that could probably be emulated with high enough number of runs or seconds.

Re 2) - for integers the space is finite but huge, so there this could work, but what about lists and strings and other collections? Would we arbitrarily decide "the input space is all lists below 50 elements"?

Re 3) I'm not completely sure these are related.

The number of tests needed grows as the the real distribution of the label nears the wanted distribution. (In Hughes' talk the numbers might be made up but anyways, with distributions 4.231% and 5% it took 51200 generated values to verify it will never reach 5%, and with distributions 4.123% and 5% it took 102400 generated values.)

EDIT: actually, now that I read this, it's backwards. We should look into the Haskell code implementing this to understand the relationship better.

This seems different from verification (with some probability p) that the test will never fail. Again I don't know how we'd find the needed number of tests.

The one metric that I believe would be able to tell us whether we've tested the program enough, is code coverage guided generation (like AFL does), perhaps with some symbolic execution sprinkled in. If you went through all the meaningfully different paths (if x < 5 then path1 else path2 only splits the values you need to check into n | n < 5 and n | n >= 5, and inside these categories any value will do), then perhaps you could say "OK we can stop fuzzing, we will not find anything new" with some certainty.

@gampleman
Copy link
Contributor Author

Yeah that would work only in addition to specifying some memory limit your application has to fit in. If you specify that your application has to fit into i.e. 100mb of RAM, than all data structures are finite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants