I've been thinking about how samples and weighted samples are represented in stat
, and in dist
, in the ConjugateUpdate
methods, and it seems like we might be able to do something interesting by providing a set of interfaces to represent different ways of providing sample data to measure.
Here's the rough outline:
type Sampler interface {
Sample() []float64
}
type WeightedSampler interface {
Sampler
Weights() []float64
}
type Sample []float64
func (s Sample) Sample() []float64 {
return s
}
type WeightedSample struct {
Sample
w []float64
}
func NewWeightedSample(s Sample, w []float64) *WeightedSample {
// panic if len doesn't match
return &WeightedSample{s, w}
}
func (ws *WeightedSample) Weights() []float64 {
return ws.w
}
And then collapse functions like Mean
to take a Sampler
, perform a type check to see if it is a WeightedSampler
, and then proceed from there.
This also ties into iterative methods where the data cannot be represented in memory easily. We could provide:
type SampleReader interface {
SampleRead([]float64) int
}
// and equivalent with weights
type IterativeSample func([]float64) int
func (is IterativeSample) SampleRead(buf []float64) int {
return is(buf)
}
func (is IterativeSample) Sample() []float64 {
buf := make([]float64, 1023)
res := make([]float64, 0)
for n := is(buf); n > 0; {
res = append(res, buf...)
}
return res // for use in cases where a SampleReader path doesn't exist
}
Which could read a portion of the sample in things like Mean
. There would be a corresponding WeightedIterativeSampler
, and additional types for multivariate samples, which would deal primarily in matrices. There is still some planning to do to allow things like buffered reading and writing.
The same types would be used in dist
for ConjugateUpdate
, Rand
, and equivalent ones would exist in a multivariate version of dist
.
Online versions of the various statistical measurements would take only a SampleReader
or WeightedSampleReader
, and would have "Memory" so that they could produce updated measurements on demand. Parallel methods could use several goroutines to call SampleRead(), and then combine the results at the end.
Any thoughts on these changes? The immediate benefit would be to simplify the signatures for functions on samples of data, and to provide a way for users to handle very large amounts of data.
edit: fix go code