GithubHelp home page GithubHelp logo

bheisler / criterion.rs Goto Github PK

View Code? Open in Web Editor NEW
4.2K 4.2K 284.0 5 MB

Statistics-driven benchmarking library for Rust

License: Apache License 2.0

Rust 84.64% Shell 0.03% Python 0.10% HTML 15.23%
benchmark criterion gnuplot rust statistics

criterion.rs's People

Contributors

38 avatar alexbool avatar andrewbanchich avatar bheisler avatar bsteinb avatar damienstanton avatar eh2406 avatar faern avatar gnzlbg avatar gz avatar heroickatora avatar humb1t avatar japaric avatar killercup avatar lemmih avatar mkantor avatar nbarrios1337 avatar palfrey avatar pierrechevalier83 avatar pseitz avatar rreverser avatar rukai avatar samueltardieu avatar sjackman avatar timmmm avatar toomanybees avatar torkleyy avatar tottoto avatar vks avatar waywardmonkeys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

criterion.rs's Issues

Spawn a second task to monitor CPU temperature and CPU load

And use this information to smartly select warm-up and cooldown times. (?)


Last time a tried to monitor CPU load (using GNU time), I saw a constant (100%) CPU usage, but I may have been measuring incorrectly (maybe, the sampling time was too large compared to the lifetime of the process).

Estimated time is totally wrong when using the new timing loops

The estimated time is wrong, so the benchmark takes way longer that the specified measurement time.

  • iter_with_large_setup: the estimated time doesn't include the large setup
  • iter_with_setup: the estimated time doesn't include the per iteration setup
  • iter_with_large_drop: the estimated time doesn't include the final destructor call

Add benchmark-main & related macros

The goal here is to allow basic usage of cargo bench without requiring the extra --test --nocapture --test-threads 1 arguments by disabling the default benchmarking harness.

See the bencher crate, which already does something similar.

Consider collecting the output of the benchmarked routine (`|| -> T`) to avoid calling `T`'s destructor during the timing loop

Currently, the timing loop looks like this:

// Self = Bencher
fn iter<T>(&mut self, routine: || -> T) {
    self.ns_start = time::precise_time_ns();
    for _ in range(0, self.niters) {
        black_box(f());
    }
    self.ns_stop = time::precise_time_ns();
}

If routine is e.g. || Vec::from_elem(1024, 0u8) (which returns Vec<u8>), then on each iteration a vector is constructed and destroyed, hence Criterion is measuring both operations. The user has no way of measuring only the construction operation with this timing loop.

If the measuring loop is changed to:

fn iter<T>(&mut self, routine: || -> T) {
    let niters = self.niters;
    let mut outputs = Vec::with_capacity(niters);

    self.ns_start = time::precise_time_ns();
    for _ in range(0, niters) {
        outputs.push(routine());
    }
    self.ns_stop = time::precise_time_ns();

    drop(outputs);
}

Now criterion will measure only the vector construction, and the vector destruction won't happen during the measurement loop, instead all the vectors will be destroyed at the end of iters scope.

The user can still measure both construction and destruction using this new timing loop by changing routine to || Vec::from_elem(1024, 0u8); which returns () (although some black_boxing may be required to prevent the compiler from optimizing the routine away).

The time model changes slightly from:

elapsed = precise_time_ns + niters * (routine + drop(T))

to:

elapsed = precise_time_ns + niters * (routine + Vec.push)

Criterion will now report/analyze the execution time of (routine + Vec.push).

I think it makes sense to provide both timing models, to let the users pick whatever works best for them.

Add a way to programmatically get benchmark results

It would be nice if a benchmarking run could return the percentiles and other statistics about the benchmark. It could be useful for multiple purposes, here are some examples:

  • Programs doing some self diagnose on startup to calibrate performance settings (Think like games finding optimal settings for your hardware automatically.)
  • Writing automatic CI performance monitoring services that automatically tests performance over commits and report to some system.
  • Library users who would like to draw their own plots or save the results in any other format than the one built into criterion.rs.

This might be related to #15, I'm not sure what that issue want to accomplish exactly.

Expand Readme

Add more information to the readme file, including basic usage examples, project goals, etc.

Librarify the analysis process

Currently criterion always performs the analysis of the benchmark in the same way. Criterion needs to be librarified to let the users configure the analysis process.

Publish to Crates.io?

Hi! Criterion is a wonderful library, and I've been using it locally for a lot of benchmarking. The problem is that I have a crate I want to publish to Crates.io ... but it depends on Criterion. Since Criterion is not on Crates, I'm unable to publish because cargo requires all dependencies to come from the same source (e.g. either all from git, or all from crates.io).

Would be fantastic if Criterion could get published, to make it easier to use in various projects!

Panic on performance regression

At line 54 of src/analysis/compare.rs we have

    let different_mean = t_test(id, avg_times, base_avg_times, criterion);
    let regressed = estimates(id, avg_times, base_avg_times, criterion);

    if different_mean && regressed.into_iter().all(|x| x) {
        panic!("{} has regressed", id);
    }

A performance regression on a benchmark probably shouldn't be a panic. Issue #51 may be relevant.

Also, there's probably no reason to call estimates() if different_mean is false. Suspect this is a laziness thing left over from the Haskell.

Profile memory usage

For Rust, perhaps this information can be extracted from jemalloc.
For external programs, I don't think there is a direct way (i.e. ask the language runtime) to extract this information.

Questions:

  • What tool should be used? valgrind?
  • Sholud profiling be performed repeatedly to check if the memory usage is constant or not? (GCed language probably won't use constant memory on each run)

Builder pattern for Criterion

The Criterion struct has grown a bunch of methods to benchmark one/many input(s) over one/many function(s)/program(s), and the methods names look like bench_program_over_inputs, which is too long for my liking. This situation can be simplified by converting Criterion into a builder struct, this lets us combine a few methods that generate all the previous outcomes. So far I got something like this working:

Criterion::default()
    .measurement_time(Duration::from_secs(3))
    .inputs(vec![1024, 32 * 1024, 1024 * 1024])
    .function(Function("alloc", |b, i| {}))
    .bench()
/* Output
Benchmarking alloc with input 1024
Benchmarking alloc with input 32768
Benchmarking alloc with input 1048576
*/
    .inputs(vec![7, 11, 13])
    .functions(vec![Function("par_fib", |b, i| {}), Function("seq_fib", |b, _| {})])
    .bench()
/* Output
Benchmarking par_fib with input 7
Benchmarking par_fib with input 11
Benchmarking par_fib with input 13
Benchmarking seq_fib with input 7
Benchmarking seq_fib with input 11
Benchmarking seq_fib with input 13
*/
    .function(Function("no_input", |b, &()| {}))
    //^ FIXME second argument of the closure is unnecessary
    .bench()
/* Output
Benchmarking no_input
*/
    .functions(vec![Function("A", |b, &()| {}), Function("B", |b, &()| {})])
    .bench();
/* Output
Benchmarking A
Benchmarking B
*/

A minimal PoC has been implemented in the criterion-builder branch. The PoC contains just the builder, no benchmarking is actually perform when you call bench.

I wonder if we can do something similar to the Bencher struct which has also grown quite a few methods for the different timing loops.

cc @faern

MultiModifier

FYI I upgraded to rust 1.0 and found this:

macros/src/lib.rs:21:9: 21:17 error: use of deprecated item: replaced by MultiModifier, #[deny(deprecated)] on by default
/git/checkouts/criterion.rs-b221423a0cb2a2ed/master/macros/src/lib.rs:21 Modifier(Box::new(expand_meta_criterion)));
Using #[allow(deprecated)] makes it compile atm.

Support per-iteration setup

For example, constructing a buffer that will be consumed by the iteration. This needs to not be counted in the elapsed time, but it's hard to do that without per-iteration timing overhead.

I opened this as a ticket for libtest's benchmarking (rust-lang/rust#18043) but if criterion.rs added support first, I would switch for sure.

restarting development and criterion's future

I'd like to get criterion back into a usable state, and I'd like to improve the
user experience around it.

Here's my plan so far, input is appreciated and PR are also welcomed.

Immediate actions: criterion 0.1.0

Basically, release criterion with its current public API, refactor its
internals, further improvements shouldn't break the API.

  • land new timing loops #26
  • make compilable on current nightly
  • refactor: use the new Instant/Duration APIs (when stable?)
  • fix travis
  • CI: test on mac. see #70
  • CI: test on windows. see #70
  • documentation: explain output/plots, add examples, how to use in a cargo
    project
  • fix Appyevor
  • release dependencies
  • start making releases.

Short term: criterion 0.1.x

  • refactor: reduce usage of unstable features.
    • Throw away or put simd behind a cargo feature, it's unstable and only works
      with x86_64
    • criterion won't work on stable anytime soon because it depends on
      test::black_box
  • colorize output messages
  • developer documentation: explain the math
  • refactor: proper error handling in internals. bubble up errors and report them instead of panicking/unwrapping
  • clippy
  • rustfmt

Future: criterion 0.2

If we are allowed breaking changes, how can we improve the user experience? Some
ideas:

  • simplify the public API
    • #26 adds several methods, perhaps all the similarly named methods can be
      replaced with a function that takes enums or some builder pattern.
  • expose errors to user instead of panicking/exiting
  • expose more internals, e.g. return an intermediate struct with the benchmark results
    instead of directly writing the results to disk.
  • move away from gnuplot, produce a web report with interactive plots
  • better integration with cargo, output files to target directory, plots
    could live in target/doc

If you used criterion in the past, I'd like to hear about your experience.

  • In general, What worked well? And, What didn't work for you?
  • How helpful/clear/confusing were the generated plots/the output message?
  • Was the performance regression detection reliable, or did you get false
    positives?
  • What functionality do you think is missing?
  • Any suggestion to improve the user experience?

Add code coverage

Set up Coveralls.io for this repository and add code coverage tracking to the Travis build.

Replace rustc_serialize with Serde

The rustc_serialize crate is officially deprecated in favor of Serde. We should update the file system code to use Serde.

Difficulty: Easy

Replace Floaty with num-traits

Floaty doesn't seem to be maintained anymore. num-traits seems to be the standard for this sort of thing. I think it should just be a matter of updating all of the trait bounds and imports in stats.

Difficulty: Easy

Add examples to doc home page

  • An example of what the code looks like, so I know how different it would be from test::Bencher.
  • An example of what the output looks like, so I know what I gain from switching.

Work with stable

If the only blocker is black_box, then perhaps for functions that return certain types we could have alternative mechanisms? e,g, calling functions with different input and collecting the output with a summation?

Roadmap

Subject to change without prior notice

Functionality:

  • Classification of outliers
  • Draw bootstrap distribution of statistics (mean,median,SD,MAD)
  • Draw bootstrap distribution of the relative change of the {mean,median,slope}
  • Check if the relative change of the (mean,median) is "significant"
    • Add threshold noise (percentage) field to the Criterion struct
    • If the whole confidence interval is below minus threshold noise,
      there has been a improvement
    • If the whole confidence interval is above plus threshold noise,
      there has been a regression
    • fail! if both the mean and median have regressed
  • Benchmark external programs
    • Decide on the interface that the external program must implement
      • for each line in stdin; niters = parse line; print ns_elapsed for niters
    • Hack up an implementation
    • Refactor/DRY the implementation
  • Summarize a bench group
    • Plot {mean,median,slope} vs input + error bars (ci)
      • add error bars for confidence intervals
      • automagically trigger log scale when appropriate
    • Draw a trendline
      • Show the formula of the trendline + R^2
      • Draw the confidence interval of the trendline
  • Implement #[criterion] with procedural macros
    • Try to replicate functionality of the --bench flag
    • Or hide the benchs as tests behind [cfg(bench)]
  • Use euler problems as a testing ground
  • Document the analysis process
  • Collect sample with a different number of iterations on each measurement
    • Number of iterations follow an arithmetic progression
    • Store iteration/total time pairs
  • Linear regression
  • Factor out statistics related code to a library

Directory hierarchy:

.criterion
`-- fib
    |-- 10
    |   |-- base
    |   |   |-- MAD.svg
    |   |   |-- SD.svg
    |   |   |-- estimates.json
    |   |   |-- mean.svg
    |   |   |-- median.svg
    |   |   |-- outliers.json
    |   |   |-- pdf.svg
    |   |   |-- regression.svg
    |   |   |-- sample.json
    |   |   `-- slope.svg
    |   |-- both
    |   |   |-- pdf.svg
    |   |   `-- regression.svg
    |   |-- change
    |   |   |-- estimates.json
    |   |   |-- mean.svg
    |   |   |-- median.svg
    |   |   `-- t-test.svg
    |   `-- new
    |       `-- (...)  # Same as `base`
    |-- {15,20,5}
    |   `-- (...)  # Same as `10`
    `-- summary
        |-- base
        |   |-- means.svg
        |   |-- medians.svg
        |   `-- slopes.svg
        `-- new
            `-- (...)  # Same as `base`

Plots/Data dumps:

  • sample.json: Iteration/total time pair stored as u64
  • estimates.json: Estimate of several statistics
  • pdf.svg
    • sample PDF
    • sample points
    • color points depending on their "outlierness"
    • fence lines
  • outliers.json: Store outlier count/fences
  • both/pdf.svg: base and new PDF on the same graph
  • both/regression.svg: base and new linear regressions on the same graph
  • {base,new}/{mean,median,SD,MAD}.svg
    • PDF of the bootstrap distribution
    • Colored CI area
    • Point estimate
  • change/{mean,median}.svg
    • PDF of the bootstrap distribution
    • Colored CI area
    • Point estimate
    • Colored threshold noise
  • summary/{base,new}/{mean,median}.svg
    • point estimates
    • error bars (CI)
    • trend line + formula

Also, plots must have:

  • Title
  • X/Y labels
  • properly scaled X/Y axis
  • Legend (if necessary)

Wishlist:

  • plots should use:
    • "nice" colors
    • "nice" line styles
    • A "nice" font
  • JSON files
    • should be "pretty" encoded

Support variant metrics - MB/s, items/s, etc

There are a lot of algorithms for which seconds/iteration can be easily transformed into a more useful metric.

For example, I'm writing a delta-compressor, and so MB/sec would be a more graspable metric.

API-wise, it seems that this could be done by adding a mapping function to the Criterion pipeline. Something like:

Criterion::default()
    .with_metric(|time, test_data| { test_data.len() / time })
    .bench_function_over_inputs(|b, data| {...});

Fix CI Builds

The Travis-CI builds are broken for a variety of different reasons.

Document the analysis process

Explain how and why criterion is executing the benchmark.

The API docs (generated by rustdoc) are not the best place to do this, so perhaps rustbook can be used to generate curated docs.

Compare memory usage

Is comparing the memory usage of benches something that might come in the future? In some cases I want to compare both the time and the memory usage.

BufWriter example. Investigate under what coditions the benchmarks get optimized away

gist

On the first batch run: copy_nonoverlapping, unsafe_writer and buf_writer7 reported times in the picosecond range for the three inline cases.

Upon commenting out those benchmarks, and re-running the whole suite. buf_writer_3, buf_writer_11, std_buf_writerreported times in the picosecond range for theinline/inlines(always)` cases. Those three benchmarks reported times in the nanosecond range in the first batch run.

This behavior seems odd. Commenting out some benchmark shouldn't affect how the compiler optimizes the untouched ones.

Or it could be a bug in criterion.

Add Example CI Configurations

It would be useful for downstream projects to use Criterion.rs in their CI pipeline to detect regressions in pull requests and similar. Add documentation and recommendations on how that could be achieved.

cargo bench does not work

I'm getting this error:

benches/macro.rs:3:11: 3:27 error: can't find crate for `criterion_macros` [E0463]
benches/macro.rs:3 #![plugin(criterion_macros)]

(Commit 128d5e5).

Fix Flaky Stats Tests

univariate::kde::test::f32::integral
univariate::bootstrap::test::f64::two_sample

Possibly others.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.