bheisler / criterion.rs Goto Github PK
View Code? Open in Web Editor NEWStatistics-driven benchmarking library for Rust
License: Apache License 2.0
Statistics-driven benchmarking library for Rust
License: Apache License 2.0
And use this information to smartly select warm-up and cooldown times. (?)
Last time a tried to monitor CPU load (using GNU time), I saw a constant (100%) CPU usage, but I may have been measuring incorrectly (maybe, the sampling time was too large compared to the lifetime of the process).
The estimated time is wrong, so the benchmark takes way longer that the specified measurement time.
iter_with_large_setup
: the estimated time doesn't include the large setupiter_with_setup
: the estimated time doesn't include the per iteration setupiter_with_large_drop
: the estimated time doesn't include the final destructor callThe goal here is to allow basic usage of cargo bench
without requiring the extra --test --nocapture --test-threads 1
arguments by disabling the default benchmarking harness.
See the bencher crate, which already does something similar.
Currently, the timing loop looks like this:
// Self = Bencher
fn iter<T>(&mut self, routine: || -> T) {
self.ns_start = time::precise_time_ns();
for _ in range(0, self.niters) {
black_box(f());
}
self.ns_stop = time::precise_time_ns();
}
If routine
is e.g. || Vec::from_elem(1024, 0u8)
(which returns Vec<u8>
), then on each iteration a vector is constructed and destroyed, hence Criterion is measuring both operations. The user has no way of measuring only the construction operation with this timing loop.
If the measuring loop is changed to:
fn iter<T>(&mut self, routine: || -> T) {
let niters = self.niters;
let mut outputs = Vec::with_capacity(niters);
self.ns_start = time::precise_time_ns();
for _ in range(0, niters) {
outputs.push(routine());
}
self.ns_stop = time::precise_time_ns();
drop(outputs);
}
Now criterion will measure only the vector construction, and the vector destruction won't happen during the measurement loop, instead all the vectors will be destroyed at the end of iter
s scope.
The user can still measure both construction and destruction using this new timing loop by changing routine
to || Vec::from_elem(1024, 0u8);
which returns ()
(although some black_box
ing may be required to prevent the compiler from optimizing the routine away).
The time model changes slightly from:
elapsed = precise_time_ns + niters * (routine + drop(T))
to:
elapsed = precise_time_ns + niters * (routine + Vec.push)
Criterion will now report/analyze the execution time of (routine + Vec.push)
.
I think it makes sense to provide both timing models, to let the users pick whatever works best for them.
It would be nice if a benchmarking run could return the percentiles and other statistics about the benchmark. It could be useful for multiple purposes, here are some examples:
This might be related to #15, I'm not sure what that issue want to accomplish exactly.
I'm missing some documentation on, say, migrating a "standard" benchmark to a criterion benchmark.
Add more information to the readme file, including basic usage examples, project goals, etc.
Currently criterion always performs the analysis of the benchmark in the same way. Criterion needs to be librarified to let the users configure the analysis process.
Hi! Criterion is a wonderful library, and I've been using it locally for a lot of benchmarking. The problem is that I have a crate I want to publish to Crates.io ... but it depends on Criterion. Since Criterion is not on Crates, I'm unable to publish because cargo requires all dependencies to come from the same source (e.g. either all from git, or all from crates.io).
Would be fantastic if Criterion could get published, to make it easier to use in various projects!
At line 54 of src/analysis/compare.rs
we have
let different_mean = t_test(id, avg_times, base_avg_times, criterion);
let regressed = estimates(id, avg_times, base_avg_times, criterion);
if different_mean && regressed.into_iter().all(|x| x) {
panic!("{} has regressed", id);
}
A performance regression on a benchmark probably shouldn't be a panic. Issue #51 may be relevant.
Also, there's probably no reason to call estimates()
if different_mean
is false. Suspect this is a laziness thing left over from the Haskell.
Maybe I missed this in the docs but how can I change the number of samples from the default 100?
Combine the stats.rs and simplot.rs repositories into this one using cargo workspaces.
For Rust, perhaps this information can be extracted from jemalloc.
For external programs, I don't think there is a direct way (i.e. ask the language runtime) to extract this information.
Questions:
The Criterion
struct has grown a bunch of methods to benchmark one/many input(s) over one/many function(s)/program(s), and the methods names look like bench_program_over_inputs
, which is too long for my liking. This situation can be simplified by converting Criterion
into a builder struct, this lets us combine a few methods that generate all the previous outcomes. So far I got something like this working:
Criterion::default()
.measurement_time(Duration::from_secs(3))
.inputs(vec![1024, 32 * 1024, 1024 * 1024])
.function(Function("alloc", |b, i| {}))
.bench()
/* Output
Benchmarking alloc with input 1024
Benchmarking alloc with input 32768
Benchmarking alloc with input 1048576
*/
.inputs(vec![7, 11, 13])
.functions(vec![Function("par_fib", |b, i| {}), Function("seq_fib", |b, _| {})])
.bench()
/* Output
Benchmarking par_fib with input 7
Benchmarking par_fib with input 11
Benchmarking par_fib with input 13
Benchmarking seq_fib with input 7
Benchmarking seq_fib with input 11
Benchmarking seq_fib with input 13
*/
.function(Function("no_input", |b, &()| {}))
//^ FIXME second argument of the closure is unnecessary
.bench()
/* Output
Benchmarking no_input
*/
.functions(vec![Function("A", |b, &()| {}), Function("B", |b, &()| {})])
.bench();
/* Output
Benchmarking A
Benchmarking B
*/
A minimal PoC has been implemented in the criterion-builder branch. The PoC contains just the builder, no benchmarking is actually perform when you call bench
.
I wonder if we can do something similar to the Bencher
struct which has also grown quite a few methods for the different timing loops.
cc @faern
FYI I upgraded to rust 1.0 and found this:
macros/src/lib.rs:21:9: 21:17 error: use of deprecated item: replaced by MultiModifier, #[deny(deprecated)] on by default
/git/checkouts/criterion.rs-b221423a0cb2a2ed/master/macros/src/lib.rs:21 Modifier(Box::new(expand_meta_criterion)));
Using #[allow(deprecated)] makes it compile atm.
For example, constructing a buffer that will be consumed by the iteration. This needs to not be counted in the elapsed time, but it's hard to do that without per-iteration timing overhead.
I opened this as a ticket for libtest
's benchmarking (rust-lang/rust#18043) but if criterion.rs
added support first, I would switch for sure.
I'd like to get criterion back into a usable state, and I'd like to improve the
user experience around it.
Here's my plan so far, input is appreciated and PR are also welcomed.
Basically, release criterion with its current public API, refactor its
internals, further improvements shouldn't break the API.
If we are allowed breaking changes, how can we improve the user experience? Some
ideas:
If you used criterion in the past, I'd like to hear about your experience.
Add a chapter to the book on how data is collected and analyzed.
Two of the new timing loops use a lot of memory. The memory usage can be limited at the cost of reducing the number of iterations, which reduces the measurement time and the precision of the measurement.
When the input is an integer.
Bonus:
Set up Coveralls.io for this repository and add code coverage tracking to the Travis build.
The rustc_serialize crate is officially deprecated in favor of Serde. We should update the file system code to use Serde.
Difficulty: Easy
Floaty doesn't seem to be maintained anymore. num-traits seems to be the standard for this sort of thing. I think it should just be a matter of updating all of the trait bounds and imports in stats.
Difficulty: Easy
They are already outdated.
test::Bencher
.Like simplot
If the only blocker is black_box, then perhaps for functions that return certain types we could have alternative mechanisms? e,g, calling functions with different input and collecting the output with a summation?
Subject to change without prior notice
Functionality:
Criterion struct
fail!
if both the mean and median have regressed#[criterion]
with procedural macros
--bench
flag[cfg(bench)]
Directory hierarchy:
.criterion
`-- fib
|-- 10
| |-- base
| | |-- MAD.svg
| | |-- SD.svg
| | |-- estimates.json
| | |-- mean.svg
| | |-- median.svg
| | |-- outliers.json
| | |-- pdf.svg
| | |-- regression.svg
| | |-- sample.json
| | `-- slope.svg
| |-- both
| | |-- pdf.svg
| | `-- regression.svg
| |-- change
| | |-- estimates.json
| | |-- mean.svg
| | |-- median.svg
| | `-- t-test.svg
| `-- new
| `-- (...) # Same as `base`
|-- {15,20,5}
| `-- (...) # Same as `10`
`-- summary
|-- base
| |-- means.svg
| |-- medians.svg
| `-- slopes.svg
`-- new
`-- (...) # Same as `base`
Plots/Data dumps:
sample.json
: Iteration/total time pair stored as u64
estimates.json
: Estimate of several statisticspdf.svg
outliers.json
: Store outlier count/fencesboth/pdf.svg
: base and new PDF on the same graphboth/regression.svg
: base and new linear regressions on the same graph{base,new}/{mean,median,SD,MAD}.svg
change/{mean,median}.svg
summary/{base,new}/{mean,median}.svg
Also, plots must have:
Wishlist:
Is there a way to use this library to investigate dependence of input size on runtime? e.g. run benchmarks on different size of inputs.
There are a lot of algorithms for which seconds/iteration can be easily transformed into a more useful metric.
For example, I'm writing a delta-compressor, and so MB/sec would be a more graspable metric.
API-wise, it seems that this could be done by adding a mapping function to the Criterion pipeline. Something like:
Criterion::default()
.with_metric(|time, test_data| { test_data.len() / time })
.bench_function_over_inputs(|b, data| {...});
Use violin plots instead of boxplots.
Blocked on japaric-archived/simplot.rs#4
The candlestick should show Q1
, Q2
, Q3
, Q1 - 1.5 * IQR
, Q3 + 1.5 * IQR
Bonus:
The Travis-CI builds are broken for a variety of different reasons.
If --cfg criterion
is passed, the crate should be compiled into a "benchmark runner", where the functions marked as #[criterion]
will be benchmarked sequentially.
libsyntax/test.rs seems to contain the implementation details of the #[test]
syntax extension.
And set optimization level of tests to -O3
The labels are not useful, since the sample can be easily reclassified using the fences
information contained in outliers.json
.
Explain how and why criterion is executing the benchmark.
The API docs (generated by rustdoc
) are not the best place to do this, so perhaps rustbook
can be used to generate curated docs.
Is comparing the memory usage of benches something that might come in the future? In some cases I want to compare both the time and the memory usage.
On the first batch run: copy_nonoverlapping
, unsafe_writer
and buf_writer7
reported times in the picosecond range for the three inline cases.
Upon commenting out those benchmarks, and re-running the whole suite. buf_writer_3
, buf_writer_11
, std_buf_writerreported times in the picosecond range for the
inline/
inlines(always)` cases. Those three benchmarks reported times in the nanosecond range in the first batch run.
This behavior seems odd. Commenting out some benchmark shouldn't affect how the compiler optimizes the untouched ones.
Or it could be a bug in criterion.
It would be useful for downstream projects to use Criterion.rs in their CI pipeline to detect regressions in pull requests and similar. Add documentation and recommendations on how that could be achieved.
I'm getting this error:
benches/macro.rs:3:11: 3:27 error: can't find crate for `criterion_macros` [E0463]
benches/macro.rs:3 #![plugin(criterion_macros)]
(Commit 128d5e5).
univariate::kde::test::f32::integral
univariate::bootstrap::test::f64::two_sample
Possibly others.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.