GithubHelp home page GithubHelp logo

Comments (4)

veprbl avatar veprbl commented on May 19, 2024

It would be an overstatement to say that it completely doesn't work:

std::vector<T> output;
std:partial_sum(axis.begin(), axis.end(), std::back_inserter(output), [] (const pair &p1, const pair &p2) { return p1.second + p2.second; });

Yes, it is less convenient, and one could provide an interface for iterating only over the values.

I see that the real problem is that example code inserts values back into the original histogram. This really feels like what ROOT users normally do. The ROOT developers and community have implemented all imaginable operations on histograms, so it made it very tempting to use them as you normally use, say, numpy arrays (I call this a histogram driven data analysis). I can imagine I don't need to tell how ugly it gets. And it applies to this case too. The cumulative distribution is not a histogram, it has a different meaning, you won't be able to safely manipulate it with routines designed to handle histograms (e.g. think how the bin statistics are not independent now). So my point is that a good histogram object should provide a means to reduce it to some other kind of value (a single number, or vector, or matrix, graph, ...), but list of operations allowed to produce a histogram should be limited to the ones that actually produce a sensible histogram.

from histogram.

HDembinski avatar HDembinski commented on May 19, 2024

I see that the real problem is that example code inserts values back into the original histogram.

I absolutely agree with you. A histogram is not a general containers of numbers, the values inside have a strong meaning. Giving write-access to the values would potentially violate the invariants of the class, notably the consistency of the variance estimates.

I will quote my answer to the boost mailing list. tl;dr: I think I should provide a conversion to Boost.MultiArray. Users can then use iterators of multi_array. The interface could look like this:

auto h3 = /* some 3d histogram */
// return copy of internal values as 3d multi_array, using value_type of histogram
// (which derives from storage policy); throws if dimension is wrong
auto array1 = h3.values<3>();
// return copy of internal values using custom value type int, throws if dimension is wrong
auto array2 = h3.values<3, int>();
// for experts, try to return a read-only view of internal values as
// const_multi_array_ref<3, int>; throws exception, if counts cannot be viewed like this;
// maybe use boost::optional?
auto array_view = histogram.values_view<3, int>();

I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed.

How should we define such iterators so that they generalise to multiple dimensions? For example, it would be easy to implement new iterators which just iterates over the bin content:

auto h = make_static_histogram(axis::regular(3, 0, 1));
// fill […], then
std::vector cumulative;
std::partial_sum(h.begin(), h.end(), cumulative.begin(), std::back_inserter(cumulative));

Questions:

  • What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?

The current design keeps the internal order an implementation detail (I use column-major internally, because it is easier to compute the strides on the fly), the two choices are fortran-like (column-major) or C-like (row-major). Perhaps provide both with clear names so it is obvious? Or just row-major, because that's the standard in C++? These questions must have been already discussed during the design of Boost.MultiArray. I think I should provide a way to provide a MultiArray copy of the bin content, thus leveraging all the work that already went into defining MultiArray. Similarly one could provide a read-only view of the variance estimates. A read-only MultiArray view of the internal bin counts of the histogram is also possible for array_storage policy, but not for the default adaptive_storage policy.

  • Writable iterators?
    In your example, you use writable iterators, to change the content (and the meaning of the content) of the histogram itself. In my reply here, I use read-only iterators, because changing the content of the histogram seems to go against the semantics. A histogram is more specific than a general container of values, the content has a strong meaning that cannot be arbitrarily changed. To be precise, what should the variance estimate, returned by histogram::variance(…) be for a cumulative count? The variance estimate for cumulative count is a whole matrix of covariances, since the successive entries are not independent. There are other operations where it is not clearly defined how the variance estimate should be computed. I think the best way to avoid these problems is to provide read-only iterators.

from histogram.

HDembinski avatar HDembinski commented on May 19, 2024

Ok, I thought about it some more. Returning a Boost.MultiArray does not work very well for several reasons. I don't want to return a copy, because the copy could be very large. Returning a view is difficult. Manually specifiying the histogram dimension in the call is also annoying.

And if you just want to iterate over the values, it should be possible, so I am thinking of adding iterators for that. The code should then look like this:

auto h = /* some histogram */
std::vector<double> csum;
std::partial_sum(h.values().begin(), h.values().end(), std::back_inserter(csum));

The iterator just goes over all normal bins, excluding under/overflow bins in implementation-defined order. For 1d-histograms it would just work, and if you need to compute something multi-dimensional, you will be able to query the index of the current element. Like so:

auto h = /* some 2d histogram */
for (auto iter = h.values().begin(); iter != h.values().end(); ++iter) {
   std::cout << iter.idx(0) << " " << iter.idx(1) << " " << *iter << std::endl;
}
// prints:
// 0 0 3.0
// 1 0 2.0
// ...

from histogram.

HDembinski avatar HDembinski commented on May 19, 2024

has been fixed mean-while in the master

from histogram.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.