bheisler / criterion.rs Goto Github PK

View Code? Open in Web Editor NEW

4.2K 4.2K 284.0 5 MB

Statistics-driven benchmarking library for Rust

License: Apache License 2.0

Rust 84.64% Shell 0.03% Python 0.10% HTML 15.23%

benchmark criterion gnuplot rust statistics

criterion.rs's People

Contributors

Stargazers

Watchers

Forkers

danburkert ryman timds markswanson faern killercup vks llogiq sga001 raof dumbomir dvdplm behnam babbageclunk maccam912 torkleyy paupino alexbool morganbauer jswrenn forgerpl anp pseitz chrisvest waywardmonkeys nickbabcock vmosone everlag djc rob-solana ebfull martin-t mykmelez rreverser jean-airoldie sunjay ignatenkobrain djrenren gnzlbg gaberudy lu-zero toomanybees joelgallant chrisduerr killthemule ammaskartik t-mw sunng87 tikv kylesiefring ggriffiniii cih-y2k rleungx huonw expixel fitzgen onurkaraman c410-f3r mackieloeffel niedhui vpetrigo pierrechevalier83 vmx laysakura kappadistributive licenser eclipseo gz isgasho mxinden jonhoo yangkeao phlip9 2892931976 tgpfeiffer haslersn jasper-bekkers realkotob dalance bobo1239 38 simonrw timmmm olegnn anderender marwes heroickatora oberien nkconnor pratyush palfrey simonwoodburyforget gereeter andlon g2p nnmm benesch satoshi-kusumoto srikwit ma1ko

criterion.rs's Issues

Spawn a second task to monitor CPU temperature and CPU load

And use this information to smartly select warm-up and cooldown times. (?)

Last time a tried to monitor CPU load (using GNU time), I saw a constant (100%) CPU usage, but I may have been measuring incorrectly (maybe, the sampling time was too large compared to the lifetime of the process).

Estimated time is totally wrong when using the new timing loops

The estimated time is wrong, so the benchmark takes way longer that the specified measurement time.

iter_with_large_setup: the estimated time doesn't include the large setup
iter_with_setup: the estimated time doesn't include the per iteration setup
iter_with_large_drop: the estimated time doesn't include the final destructor call

Add benchmark-main & related macros

The goal here is to allow basic usage of cargo bench without requiring the extra --test --nocapture --test-threads 1 arguments by disabling the default benchmarking harness.

See the bencher crate, which already does something similar.

Consider collecting the output of the benchmarked routine (`|| -> T`) to avoid calling `T`'s destructor during the timing loop

Currently, the timing loop looks like this:

// Self = Bencher
fn iter<T>(&mut self, routine: || -> T) {
    self.ns_start = time::precise_time_ns();
    for _ in range(0, self.niters) {
        black_box(f());
    }
    self.ns_stop = time::precise_time_ns();
}

If routine is e.g. || Vec::from_elem(1024, 0u8) (which returns Vec<u8>), then on each iteration a vector is constructed and destroyed, hence Criterion is measuring both operations. The user has no way of measuring only the construction operation with this timing loop.

If the measuring loop is changed to:

fn iter<T>(&mut self, routine: || -> T) {
    let niters = self.niters;
    let mut outputs = Vec::with_capacity(niters);

    self.ns_start = time::precise_time_ns();
    for _ in range(0, niters) {
        outputs.push(routine());
    }
    self.ns_stop = time::precise_time_ns();

    drop(outputs);
}

Now criterion will measure only the vector construction, and the vector destruction won't happen during the measurement loop, instead all the vectors will be destroyed at the end of iters scope.

The user can still measure both construction and destruction using this new timing loop by changing routine to || Vec::from_elem(1024, 0u8); which returns () (although some black_boxing may be required to prevent the compiler from optimizing the routine away).

The time model changes slightly from:

elapsed = precise_time_ns + niters * (routine + drop(T))

to:

elapsed = precise_time_ns + niters * (routine + Vec.push)

Criterion will now report/analyze the execution time of (routine + Vec.push).

I think it makes sense to provide both timing models, to let the users pick whatever works best for them.

Add a way to programmatically get benchmark results

It would be nice if a benchmarking run could return the percentiles and other statistics about the benchmark. It could be useful for multiple purposes, here are some examples:

Programs doing some self diagnose on startup to calibrate performance settings (Think like games finding optimal settings for your hardware automatically.)
Writing automatic CI performance monitoring services that automatically tests performance over commits and report to some system.
Library users who would like to draw their own plots or save the results in any other format than the one built into criterion.rs.

This might be related to #15, I'm not sure what that issue want to accomplish exactly.

Add a high-level tutorial

I'm missing some documentation on, say, migrating a "standard" benchmark to a criterion benchmark.

Expand Readme

Add more information to the readme file, including basic usage examples, project goals, etc.

Sometimes there are no ticlabels when the axis is in logscale

http://i.imgur.com/Euw0tjC.png

Librarify the analysis process

Currently criterion always performs the analysis of the benchmark in the same way. Criterion needs to be librarified to let the users configure the analysis process.

Publish to Crates.io?

Hi! Criterion is a wonderful library, and I've been using it locally for a lot of benchmarking. The problem is that I have a crate I want to publish to Crates.io ... but it depends on Criterion. Since Criterion is not on Crates, I'm unable to publish because cargo requires all dependencies to come from the same source (e.g. either all from git, or all from crates.io).

Would be fantastic if Criterion could get published, to make it easier to use in various projects!

Panic on performance regression

At line 54 of src/analysis/compare.rs we have

    let different_mean = t_test(id, avg_times, base_avg_times, criterion);
    let regressed = estimates(id, avg_times, base_avg_times, criterion);

    if different_mean && regressed.into_iter().all(|x| x) {
        panic!("{} has regressed", id);
    }

A performance regression on a benchmark probably shouldn't be a panic. Issue #51 may be relevant.

Also, there's probably no reason to call estimates() if different_mean is false. Suspect this is a laziness thing left over from the Haskell.

Does JSON store `f64`s losslessly?

Document all the timing loops provided by `Bencher`

If I implement #23 and #24, the documentation will have to be updated to show how the timing loop looks like and what's its time model. The documentation should also discuss when to use each timing model.

[question] How to change the number of samples?

Maybe I missed this in the docs but how can I change the number of samples from the default 100?

Consolidate Repositories

Combine the stats.rs and simplot.rs repositories into this one using cargo workspaces.

Unbox the closures

Profile memory usage

For Rust, perhaps this information can be extracted from jemalloc.
For external programs, I don't think there is a direct way (i.e. ask the language runtime) to extract this information.

Questions:

What tool should be used? valgrind?
Sholud profiling be performed repeatedly to check if the memory usage is constant or not? (GCed language probably won't use constant memory on each run)

Builder pattern for Criterion

The Criterion struct has grown a bunch of methods to benchmark one/many input(s) over one/many function(s)/program(s), and the methods names look like bench_program_over_inputs, which is too long for my liking. This situation can be simplified by converting Criterion into a builder struct, this lets us combine a few methods that generate all the previous outcomes. So far I got something like this working:

Criterion::default()
    .measurement_time(Duration::from_secs(3))
    .inputs(vec![1024, 32 * 1024, 1024 * 1024])
    .function(Function("alloc", |b, i| {}))
    .bench()
/* Output
Benchmarking alloc with input 1024
Benchmarking alloc with input 32768
Benchmarking alloc with input 1048576
*/
    .inputs(vec![7, 11, 13])
    .functions(vec![Function("par_fib", |b, i| {}), Function("seq_fib", |b, _| {})])
    .bench()
/* Output
Benchmarking par_fib with input 7
Benchmarking par_fib with input 11
Benchmarking par_fib with input 13
Benchmarking seq_fib with input 7
Benchmarking seq_fib with input 11
Benchmarking seq_fib with input 13
*/
    .function(Function("no_input", |b, &()| {}))
    //^ FIXME second argument of the closure is unnecessary
    .bench()
/* Output
Benchmarking no_input
*/
    .functions(vec![Function("A", |b, &()| {}), Function("B", |b, &()| {})])
    .bench();
/* Output
Benchmarking A
Benchmarking B
*/

A minimal PoC has been implemented in the criterion-builder branch. The PoC contains just the builder, no benchmarking is actually perform when you call bench.

I wonder if we can do something similar to the Bencher struct which has also grown quite a few methods for the different timing loops.

cc @faern

MultiModifier

FYI I upgraded to rust 1.0 and found this:

macros/src/lib.rs:21:9: 21:17 error: use of deprecated item: replaced by MultiModifier, #[deny(deprecated)] on by default
/git/checkouts/criterion.rs-b221423a0cb2a2ed/master/macros/src/lib.rs:21 Modifier(Box::new(expand_meta_criterion)));
Using #[allow(deprecated)] makes it compile atm.

Support per-iteration setup

For example, constructing a buffer that will be consumed by the iteration. This needs to not be counted in the elapsed time, but it's hard to do that without per-iteration timing overhead.

I opened this as a ticket for libtest's benchmarking (rust-lang/rust#18043) but if criterion.rs added support first, I would switch for sure.

restarting development and criterion's future

I'd like to get criterion back into a usable state, and I'd like to improve the
user experience around it.

Here's my plan so far, input is appreciated and PR are also welcomed.

Immediate actions: criterion 0.1.0

Basically, release criterion with its current public API, refactor its
internals, further improvements shouldn't break the API.

Short term: criterion 0.1.x

refactor: reduce usage of unstable features.
- Throw away or put simd behind a cargo feature, it's unstable and only works
  with x86_64
- criterion won't work on stable anytime soon because it depends on
  test::black_box
colorize output messages
developer documentation: explain the math
refactor: proper error handling in internals. bubble up errors and report them instead of panicking/unwrapping
clippy
rustfmt

Future: criterion 0.2

If we are allowed breaking changes, how can we improve the user experience? Some
ideas:

simplify the public API
- #26 adds several methods, perhaps all the similarly named methods can be
  replaced with a function that takes enums or some builder pattern.
expose errors to user instead of panicking/exiting
expose more internals, e.g. return an intermediate struct with the benchmark results
instead of directly writing the results to disk.
move away from gnuplot, produce a web report with interactive plots
better integration with cargo, output files to target directory, plots
could live in target/doc

If you used criterion in the past, I'd like to hear about your experience.

In general, What worked well? And, What didn't work for you?
How helpful/clear/confusing were the generated plots/the output message?
Was the performance regression detection reliable, or did you get false
positives?
What functionality do you think is missing?
Any suggestion to improve the user experience?

Add Statistics Documentation

Add a chapter to the book on how data is collected and analyzed.

Add an option to `Criterion` to limit maximum memory usage

Two of the new timing loops use a lot of memory. The memory usage can be limited at the cost of reducing the number of iterations, which reduces the measurement time and the precision of the measurement.

Accept `I: Intoiterator` instead of slices

Draw trendline in summary plots

When the input is an integer.

Bonus:

Draw confidence interval of the trendline
Show trendline formula and R^2

lift errors instead of just panicking

CI: test on mac and windows

Add code coverage

Set up Coveralls.io for this repository and add code coverage tracking to the Travis build.

Replace rustc_serialize with Serde

The rustc_serialize crate is officially deprecated in favor of Serde. We should update the file system code to use Serde.

Difficulty: Easy

Replace Floaty with num-traits

Floaty doesn't seem to be maintained anymore. num-traits seems to be the standard for this sort of thing. I think it should just be a matter of updating all of the trait bounds and imports in stats.

Difficulty: Easy

Mark doctests as `no_run` to avoid bitrot

They are already outdated.

Add examples to doc home page

An example of what the code looks like, so I know how different it would be from test::Bencher.
An example of what the output looks like, so I know what I gain from switching.

Add plot images to the API docs

Like simplot

Work with stable

If the only blocker is black_box, then perhaps for functions that return certain types we could have alternative mechanisms? e,g, calling functions with different input and collecting the output with a summation?

Roadmap

Subject to change without prior notice

Functionality:

Directory hierarchy:

.criterion
`-- fib
    |-- 10
    |   |-- base
    |   |   |-- MAD.svg
    |   |   |-- SD.svg
    |   |   |-- estimates.json
    |   |   |-- mean.svg
    |   |   |-- median.svg
    |   |   |-- outliers.json
    |   |   |-- pdf.svg
    |   |   |-- regression.svg
    |   |   |-- sample.json
    |   |   `-- slope.svg
    |   |-- both
    |   |   |-- pdf.svg
    |   |   `-- regression.svg
    |   |-- change
    |   |   |-- estimates.json
    |   |   |-- mean.svg
    |   |   |-- median.svg
    |   |   `-- t-test.svg
    |   `-- new
    |       `-- (...)  # Same as `base`
    |-- {15,20,5}
    |   `-- (...)  # Same as `10`
    `-- summary
        |-- base
        |   |-- means.svg
        |   |-- medians.svg
        |   `-- slopes.svg
        `-- new
            `-- (...)  # Same as `base`

Plots/Data dumps:

Also, plots must have:

Title
X/Y labels
properly scaled X/Y axis
Legend (if necessary)

Wishlist:

plots should use:
- "nice" colors
- "nice" line styles
- A "nice" font
JSON files
- should be "pretty" encoded

Right Y axis of the PDF plot should show be labeled/ticked as "number of iterations"

Run benchmarks on inputs of varying size?

Is there a way to use this library to investigate dependence of input size on runtime? e.g. run benchmarks on different size of inputs.

Support variant metrics - MB/s, items/s, etc

There are a lot of algorithms for which seconds/iteration can be easily transformed into a more useful metric.

For example, I'm writing a delta-compressor, and so MB/sec would be a more graspable metric.

API-wise, it seems that this could be done by adding a mapping function to the Criterion pipeline. Something like:

Criterion::default()
    .with_metric(|time, test_data| { test_data.len() / time })
    .bench_function_over_inputs(|b, data| {...});

Add a summary plot using violin plots

Update

Use violin plots instead of boxplots.

Original

Blocked on japaric-archived/simplot.rs#4

The candlestick should show Q1, Q2, Q3, Q1 - 1.5 * IQR, Q3 + 1.5 * IQR

Bonus:

Also plot the outliers (this needs some tough to avoid mistaking the outliers of one benchmark with the outliers of the contiguous benchmark)

Fix CI Builds

The Travis-CI builds are broken for a variety of different reasons.

Criterion output should report estimated maximum memory usage

Replicate the functionality of the `--bench` flag

If --cfg criterion is passed, the crate should be compiled into a "benchmark runner", where the functions marked as #[criterion] will be benchmarked sequentially.

libsyntax/test.rs seems to contain the implementation details of the #[test] syntax extension.

Move the tests from the `benches` directory to the `tests` directory

And set optimization level of tests to -O3

Don't store ouliers labels in outliers.json

The labels are not useful, since the sample can be easily reclassified using the fences information contained in outliers.json.

Document the analysis process

Explain how and why criterion is executing the benchmark.

The API docs (generated by rustdoc) are not the best place to do this, so perhaps rustbook can be used to generate curated docs.

Compare memory usage

Is comparing the memory usage of benches something that might come in the future? In some cases I want to compare both the time and the memory usage.

BufWriter example. Investigate under what coditions the benchmarks get optimized away

gist

On the first batch run: copy_nonoverlapping, unsafe_writer and buf_writer7 reported times in the picosecond range for the three inline cases.

Upon commenting out those benchmarks, and re-running the whole suite. buf_writer_3, buf_writer_11, std_buf_writerreported times in the picosecond range for theinline/inlines(always)` cases. Those three benchmarks reported times in the nanosecond range in the first batch run.

This behavior seems odd. Commenting out some benchmark shouldn't affect how the compiler optimizes the untouched ones.

Or it could be a bug in criterion.

Add Example CI Configurations

It would be useful for downstream projects to use Criterion.rs in their CI pipeline to detect regressions in pull requests and similar. Add documentation and recommendations on how that could be achieved.

cargo bench does not work

I'm getting this error:

benches/macro.rs:3:11: 3:27 error: can't find crate for `criterion_macros` [E0463]
benches/macro.rs:3 #![plugin(criterion_macros)]

(Commit 128d5e5).

Fix Flaky Stats Tests

univariate::kde::test::f32::integral
univariate::bootstrap::test::f64::two_sample

Possibly others.