xoopr / distr6 Goto Github PK
View Code? Open in Web Editor NEWR6 object-oriented interface for probability distributions.
Home Page: https://xoopr.github.io/distr6/
License: Other
R6 object-oriented interface for probability distributions.
Home Page: https://xoopr.github.io/distr6/
License: Other
Two options:
Option 2. seems better
One of the most powerful features of distr is the ability to automatic generate the d/p/q/r functions if not all are supplied to a constructor. Questions to consider now are if the current generating functions for these can be brought forward to an R6 upgrade or if there are other methods for doing so?
As part of the design package I'm doing a feasibility sketch, i.e. are our design choices possible in R6? Specifically the Decorator design pattern and Multiple Inheritance.
So far I have the following work around for Decorators. In this way there are no concrete decorator classes that can be constructed but instead functions in the constructor for the Distribution object that add methods from decorators when requested. This has the following consequences:
If anyone has any comments on this let me know, if not I'll close this issue after a week and proceed with this style of implementation for decorators.
Let X be a distribution with pdf f_x and a support on the interval [a, b]. By definition, f_x(c) = 0 for c < a or c > b. But this can be implemented in one of two ways:
The difference comes down to deciding if the user should implement these checks in a custom distribution (i.e. via Distribution$new(...)
) or if the API design should be such that a call first to liesInSupport()
is performed before calling the user-defined pdf/cdf.
I am learning toward (1.) as it means there is less room for user-error, also it simplifies the formulas required for wrappers such as truncation.
In terms of the user experience I can see three options for convolution, and more generally any functions that generate a new distribution from two or more constructed ones:
By example,
Z = Exp$new(rate = 1) + Exp$new(rate=1)
Then Z$pdf
gives a function involving rate, which gives three options:
The first option is the least flexible but requires the least amount of work. Arguments for this option are a) that the user can re-call the convolution after updating the parameters of the distributions and b) strictly speaking convolution of distributions is dependent on parameters and therefore they should remain fixed. Arguments against this (apart from flexibility) are that the user may want to keep working with the original distributions but only change their parameters after the convolution.
The second and third option are both considerably more work but add greater flexibility for updating parameters at anytime. There is a possible midpoint between the first and third options in that there is a wrapper with access to the original distributions, but the parameters cannot be edited after convolution.
Any thoughts anyone has on this is appreciated.
Just to add this as a discussion point:
for various reasons, one may like to have some more exotic functionals related to distributions, such as n-th (anti-)derivatives of pdf, lp-norm of the various distribution characterizing functions, or mgf/cgf.
It's reasonably easy to plan them in from the start if we wanted to, but given any design, one may ask the question:
Due to the way R6 classes are stored, functionality is not as it should be. For example the following isn't currently possible Normal$new() %>% decorate(CoreStatistics)
. Whilst N <- Normal$new(); N %>% decorate(CoreStatistics)
does work the message produced is incorrect: ". is now decorated with CoreStatistics"
Our last discussion concluded that conditional distributions should be implemented as wrappers. This seemed like the natural choice as wrappers inherit methods from DistributionWrapper
that get all wrapped models and overload any methods/variables as required. However we also concluded that joint distributions inherit from Distribution and are not wrappers. This seems counterintuitive as the two are naturally related.
I would argue that a better option may be to have constructors for JointDistribution
and ConditionalDistribution
with methods to get one from the other where appropriate/possible and exploiting properties of independence etc. where possible.
Thoughts?
After a day of going back and forth with the best way to juggle between a clean interface and preventing problems with numerical calculations when analytics are available and dealing with impution of p/d/q/r, I think I have come up with a good compromise.
integrate
.integrate
Convolution
will be analytic by default but numeric with warning otherwise. i.e. these won't be two separate functions but will just use whatever is available. So if a user calls Convolution$new(Bernoulli, Bernoulli)
then this will default to a Binomial, but if it is called on custom distributions, the result is computed numerically (with warning). Same goes for other wrappers.The result of this is that not all Distributions will have the same methods available as part of the core functionality (not to be confused with the CoreStatistics functionality). For example say that there is no analytic result for the Kurtosis of an Exponential distribution (there is, it's 9), then the Binomial distribution core would have a method Kurtosis whereas Exponential would not, but this would be available as a numeric result after calling CoreStatistics$new
.
Finally, in constructing a decorator, we include the parameter show.warnings
which if TRUE
, means that a message (not a warning as these stack and become incredibly frustrating) is produced saying that the result is approximate only, or if FALSE
, doesn't produce any warnings.
@fkiraly I think this may be in line with what you were originally thinking. If it is, sorry it took me so long to reach it!
Again, thoughts and comments appreciated.
The current suggestion for the print method is to have the familiar distribution name followed by parameters, e.g. Binomial(10, 0.2)
. However for summary()
it may be sensible to have various arguments that determine how much information is provided, for example the summary method could return all reasonable statistical functions of interest, e.g. mean, var, kurtosis. Alternatively in the simplest case it could just provide the distribution along with named parameters and values.
Should these be implemented as:
i) Inherited classes
ii) Composite classes
iii) Wrappers (i.e. composite classes without being formally defined as a new class, identified via some sort of flag, e.g. Truncated=TRUE)
There are a number of decisions to make regarding how to name methods, I'm talking specifically about the actual method names and not the case they're written in. These decisions include:
D$kurtosis() == kurtosis(D)
D$p(1)
or D$pdf(1)
or D$density(1)
e.g.2 D$mean()
or D$expectation()
or D$E()
D$p(1)
and D$pdf(1)
and D$density(1)
My personal opinion is that S3 and R6 methods should have the same name but at times it may be worth introducing extra S3 methods to be in line with common functions. For example it may be worth having all of: D$expectation()
, D$mean()
, expectation(D)
, mean(D)
or as pointed out by @PeterRuckdeschel as expectation is more informative, perhaps only: D$expectation()
, expectation(D)
, mean(D)
, where the final mean
is kept in line with common practices in R. Additionally I think it may be useful to have both pdf(D)
and pmf(D)
or density(D)
and mass(D)
. Extra pointer functions are cheap for S3 methods, so I think by having only one representation for each in R6 we can then use S3 to define multiple names (although I understand from an abstraction perspective this may be a bad idea).
Have we decided how, or whether, to specify the domain in which the random variable can take values?
This may make sense in multiple ways:
(i) the set where the r.v. with the given parameters may take values
(ii) the set where a r.v. within the family may take values
(iii) the "smallest standard superset" of where it may take values, e.g., R, R-plus, N, Z, etc
(iv) a combination of the above
I think (iii) makes type checks easy in cases where you combine random variables arithmetically, e.g., if you want to ask if they are both non-negative real etc you don't care about the precise support.
Whereas (i) is useful if you want to infer bounds or create composites.
An obvious sub-question is if there should be multiple domain specifications - e.g., one of the precise support, one of a more generic "scientific type".
R62S3 is used in the zzz.r file. We could either call the function in loading of the package or attaching. In the former case, functions are loaded into local memory when the package is installed and loaded. In the latter, functions are loaded into local memory on use of library(distr6)
. I think the latter makes more sense UNLESS there is some reason why someone may want to use the S3 dispatch via distr6::somefun(...)
but I think that this already adds more cumbersome interfacing than would be desired for the simple S3 dispatch methods...
All equations in \eqn and \deqn should be of a standard latex format with correct escape sequences, greek letters, etc. Need to check pdf vignettes to check these
Given problems with multiple inheritance and the relatively few number of methods that are actually defined in one of the abstract classes for continuous/discrete/univariate etc. is there actually any reason to have these defined as separate classes?
I think we should consider simplifying the structure and opting for a behavioural pattern instead that checks the properties/traits and calls the relevant method (as opposed to identifying the class hierarchy and dispatching on this).
This reduces the number of classes and problems with multiple inheritance and will later help in extensibility/scalability.
Does anyone have any opinions against this?
Decide:
i) How best to separate 'properties' and 'traits', if indeed this separation is sensible
ii) What should be included in both
iii) If these should be defined as R6 classes or named lists
In short multiple inheritance does not work in R6. I have also seen a lot of arguments against multiple inheritance in general and therefore wonder if we can find a different design pattern that may work. I suggest using a form of the Adaptor pattern as follows:
Distribution <-
and then splits into ValueSupport and VariateFormXvTDistribution
Adapter that has one field: list(ValueSupport, VariateForm)
Pseuo-code:
XvTDistribution <- R6Class(public = list(Type = list(ValueSupport, VariateForm)),density())
BinomialDistribution <- R6Class(inherit=XvTDistribution)
BinomialDistribution$new() $>$ density()
# This calls the density function in the parent adaptor 'XvTDistribution', which identifies
# the caller as Discrete and Univariate and adapts the call to the relevant density functions.
Truncated distributions are currently a wrapper, inheriting from the generic wrapper class, which inherits from Distribution. However as truncated distributions are technically conditional, should they directly inherit from ConditionalDistribution or does this not matter as everything is overloaded as required when wrapping?
distr uses the following hierarchy:
Distribution --> Univariate/Multivariate --> Discrete/Continuous
Python uses the following:
Distribution --> Discrete/Continuous --> Univariate/Multivariate
From reading through the distr vignettes I am assuming, but please correct me if wrong, that the former was chosen as you systematically implemented univariate distributions before moving to multivariate. However I believe that there are more interface design differences between continuous/discrete distributions than univariate/multivariate, so should we instead be considering the latter option?
What is the best way to approximate the expectation and variance for distributions (continuous or otherwise) over a possibly infinite support?
How best to implement this information and how much is required vs. useful? Which will inform abstraction amount.
We use sets at least three times in the current designs, these include in 'Type', 'distrDomain' and 'support'. I've been playing around with the 'sets' library, which allows for common sets reals()
integers()
as well as basic set operations (intersection, union) and (in)equalities such as subsetting. The print method isn't amazing but it's informative and that's what counts!
For example
reals()
> [-Inf, Inf]
integers()
> -Inf...Inf
naturals(r = 10)
> 1..10
If anyone can think of a good reason to define our own object let me know, to me this looks good.
An issue regarding convention, how should we proceed with getters and setters?
Either the strict OOP route of every variable and constant being private and only accessible via get() or we could allow some variables (e.g. dictionaries, names) to be public and accessed via distr$variable. I personally think the former may be preferred as it is very easy for a user to accidentally overwrite a public variable but much harder to do this accidentally to a private one.
Additionally, for elements in dictionaries, for example traits, should these be accessible via individual getters or via method chaining. e.g. to get 'type':
# Option 1 via individual getter
distr$getType()
# Option 2 via chaining
distr$getTraits()["Type"]
distr$getTraits()$get("Type") # Equivalently
distr$getTraits() %>% get("Type") # Equivalently
The first is more work and less flexible, the second requires slightly more user effort. In terms of extensibility, I think the second may be preferred.
I will opt for strict OOP getters/setters and method chaining and close this issue after a week if no-one has a strong opinion against.
This is low priority and can be moved into V2. All CoreStatistics decorator functionality is now in place and passing tests in testthat. i.e. functions, including expectation and generalised kth moment, generated from discrete and continuous distributions are equal to expected analytic results.
The question is whether we should allow these to be further customised by the user to supply their own functions, or if this is sufficient as individual functions can be overloaded when required.
Due to the CRAN notes problem, properties are defined as follows:
.properties
is a private list holding the property detailsproperties()
is a public getterBut traits are defined by a single traits
list which is a locked environment (therefore un-editable). For consistency should we change traits so that it is defined like properties
above?
Following conversation with Peter, a good idea is to have a plot() method that can return a "plottable structure" as well/instead of a plot to a graphical device. By a plottable structure we mean some sort of list that has details to be passed to plot/ggplot or that can be used for analysis. Additionally the generalised qqplot given in distrMod is definitely a good idea to carry forward. The 'issue' is to ascertain what details should go into this plotting structure that can be passed to either plot or ggplot and what the default plot should be for a simple call to e.g. plot(Norm())
.
As most methods are not abbreviations but full names, it might be more in line to rename as follows: pdf -> density
and cdf -> cumulative
this has the added benefit that defining pdf
as an S3 generic doesn't mask the grDevices::pdf
function.
There is a very strong argument to add S3 constructors for every implemented distribution for the less technical user.
e.g. Binomial$new(size = 10, prob = 0.5) -> binom(size = 10, prob = 0.5)
However to do so we need to be careful about the name of the constructors. I opt for the short name, i.e. Binomial = binom, Normal = norm, however this creates problems with things like Exponential = exp
By example: the convolution of indep. Binom distributions with the same probability can be expressed by
sum_i Binomial(n_i, p) = Binomial(sum_i n_i, p)
This result holds only when each Binomial distribution has the same probability. Therefore if a user calls Convolution$new(Binom1, Binom2)
for two Binomials with the same prob, should the resulting wrapper have the parameters: Binom1_size, Binom2_size, prob
or Binom1_size, Binom2_size, Binom1_prob, Binom2_prob
. In the case of the former the above result is exploited but the user loses the flexibility of setting both internal probabilities separately. In the latter case we could either have a validation check which uses the above results when Binom1_prob == Binom2_prob
or the analytic result of the convolution of two general binomials (not necessarily equal prob), which reduces to the above when probs are identical.
@fkiraly Could deriving the convolutions of each sensible distribution combination be a good exercise for the interns as part of implementing the distributions?
The question of traceability was raised by @PeterRuckdeschel as he noticed that the distr code can be quite complex for a non-technical user to understand, especially in relation to how an automatically generated function was created. The suggestion was to have a separate variable that pulls out the relevant information for a non-technical user should they want to understand the transformation or generation.
My personal opinion (and correct me if I'm wrong but I believe @PeterRuckdeschel noted the same) is that a technical user can probably navigate the complex code and a non-technical user may not actually care too much about this traceability in the first place, thus making this extra feature more cost than benefit.
But if anyone has any thoughts to add, please share them here.
Currently we have that imputation of d/p/q/r should be wrappers, for example for a distribution X with a pdf and no cdf, a user can impute the cdf via:
ImputeCdf$new(X, strategy = "pdf2cdf")
However the problem with this is that if a user imputes p/q/r all from d then they end up with a distribution called imputedR_imputedQ_imputedP_X
with parameters imputedR_imputedQ_imputedP_param1
.
I think it is more sensible to have these imputations as functions that take as input the distribution and alter the respective function. e.g. imputeCDF(X) would return X but with the cdf imputed via a given strategy. Then we could include dynamic comments in the function and add to the distribution description a note detailing the imputation.
Thoughts?
For uniformity across all distribution classes and to enable abstraction where possible, we aim to reduce the number of classes that are taken as parameters/slots. To this end I can think of three options:
By example take BinomFamily()
with the following structure (name on LHS, class on RHS):
Say we want to instead have L2Deriv as a list or dictionary. L2Deriv currently has the following slots/sub-slots:
Now instead as a dictionary:
Or as list:
By using the latter two we avoid having to define a getter/setter for each of the slots/sub-slots. A dictionary is a slightly more flexible R6 approach but a list is more familiar to the average R user.
Note: The above is only an example and I'm not suggesting that L2Deriv necessarily should be reduced in this way, only that whatever approach we take should be consistent across the whole interface.
For a general probability distribution, what is the most efficient way to find the mode? For continuous univariate functions, the function optimize
should be a good approximation, but doesn't work for discrete distributions. Perhaps @PeterRuckdeschel can advise based on how distr does this?
I've seen arguments online for and against this practice.
Arguments against:
Arguments for:
My opinion is to opt for a compromise of the two. Use docstrings/roxygen for public variables/methods which will then show up in the R help pages and documentation. Use comments for private variables/methods which will only show when accessing the functions.
All your opinions appreciated.
Use of R62S3 means that all S3 methods must be documented. Which may help answer #40. Documentation may therefore need a rethink for best practices
One thing that has come up multiple times in these issues and discussions is deciding when to perform validity checks and when/what errors to throw.
My suggestion is two-fold:
checkmate
package which has three main types of checks, by example:testNumeric(x) # TRUE if x is numeric, FALSE otherwise
checkNumeric(x) # TRUE if x is numeric, String "Must be of type 'numeric', not '...'" otherwise
assertNumeric(x) # invisible(x) if x is numeric, Error: Assertion on '"x"' failed: Must be of type 'numeric', not '...'.
And we make similar as required, e.g. test/check/assertDomain
debug
with possible arguments test/check/assert/ignore which call the respective check or skip validation altogetherThis puts debugging fully in control of the user and is relatively efficient. By default I think check
is the safest in that it won't crash the code but will return a warning message. Packages such as mlr
use assert as default.
Currently on build there is a note about unsafe calls to unlockBinding, this is currently required in construction of distributions and on change of support. There are clear ways of fixing this, e.g. by introducing more getters/setters and moving variables from private to public. If we can give a clear argument as to why we rely on unlockBinding however then it will pass CRAN submission. We can return to this before submission or see how it goes at the time.
What is the best way to implement an Estimate style of method? Some possible suggestions:
Currently the distr design has the Distribution classes such that their main function is the p/d/q/r functions, as well as being available for statistical functions (e.g. mean, var, etc.). The ProbFamily classes contain more detailed information about the distribution itself, including all details about parameterisation. As pointed out by Peter this abstraction is sensible as one the parameters are given, the distribution doesn't require any other knowledge to make use of p/d/q/r and other information could overwhelm the user. The question is then how best to perform this abstraction and to what level?
As identified in distr, there are many internal options that can be set by the user that could seem overwhelming or get in the way of clean code, hence these are made global options. The advantage of global options is clear code, less parameters for the non-technical user to worry about and once set they can be ignored. The disadvantage is that they may be forgotten before usage or may put-off a lazy user.
I think it is worth taking the time to go through all these options and classifying them as follows:
To demonstrate these by example:
eps
to 1e-15. This is hard-coded into the function and is therefore determined to be Option 3, whereas it could have been considered Option 1 or 2.integrate
includes a rel.tol
parameter to adjust relative accuracy. This defaults based on system settings. This is classified as Option 1 (and the default is based on Option 2)Due to complicated and precise documentation structures required in Roxygen, automatically generating and assigning S3 generics and methods is proving virtually impossible. However, there is no good reason to implement the dispatch system anyway as every single method is the same internally and therefore not class independent. On release of R62S3 v2, implement R62Fun in the zzz.r file.
Exporting assertions such as test/check/assert/isLeptokurtic means that anyone can use these functions externally, otherwise they only exist as internal helper functions. It currently takes my computer around half a minute to load the package, so there is definitely an argument to not export them. However a modelling package using distr6 may benefit from these assertions.
Finally, the question of how to implement wrappers. In the following tests I have tried two methods:
The third method is similar to suggested for decorators/inheritance where the methods and variables of the old class are copied individually to the wrapped class.
The complete list of suggested classes and methods for distr6 can be found here. Of course this is not finalised and things will change in implementation but this will mark the end of the initial designs. This issue is for discussion regarding these, for example is anything obvious missing or should any functions be moved between decorators, etc.
If there are no opinions on this/everyone is satisfied, I'll close this down in a week.
The roxygen mantra is do repeat yourself in documentation. For inheritance this can get quite tedious. Whilst roxygen supports inherited methods, if we inherit every method down every tree, this could get very messy and with method overloading it could also be quite complicated. However, by not either inheriting (via docstrings) or adding each inherited method manually, the user is left to follow a chain of links to find all methods.
Which of the two do we opt for:
All the checks around type, distrDomain, support, valueSupport and variateForm could be better streamlined. All can be used to define each other and defaults aren't clear. This is a low priority for now and can be looked at once stable.
Decide if/how to carry forward the following:
I have added a private variable with private getters/setters called workingSupport
and I'm not entirely sure how useful it is yet and how it should be combined with support
. The purpose of workingSupport
is two-fold, 1. to increase speed/efficiency for numeric calculations, 2. to reduce the size of the support of a distribution, especially when non-finite. The function .setWorkingSupport()
works as follows:
If distr$sup() = Inf {
sample = dist$rand(10000)
newsup = max(sample)
while(distr$pdf(newsup) > .Machine$double.eps)
newsup = newsup + 1
return(newsup)
}
Analogously for distr$inf().
Currently this is quite inefficient, particularly in sampling but before I worry about any improvements I have the following questions:
We agreed not to automatically generate the cdf if not supplied. In this case, how should a truncation/huberization wrapper be implemented? Currently for a dist X with pdf, f(x) and cdf, F(x), the pdf of trunc(X) at limits (a,b) is implemented via pdf(trunc(X)) = f(x)/(F(b)-F(a)). Huberization relies on cdf for the values at the borders.
So when cdf is not supplied should we:
a) throw an error that breaks the code
b) calculate cdf via numerical integration (when required) and throw a warning/message
c) use a different method for truncation/huberization (of which is currently unknown to me)
I personally lean towards b) with a message added to the description in the wrapper along the lines of "trunc(X) calculated via numerical integration of f(x), therefore results are approximate".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.