xoopr / distr6 Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 23.0 19.46 MB

R6 object-oriented interface for probability distributions.

Home Page: https://xoopr.github.io/distr6/

License: Other

R 93.00% C++ 7.00%

distr6's People

Contributors

Stargazers

Watchers

distr6's Issues

R62S3 for all distributions

Two options:

Search environment and R62S3 on all distributions, but this isn't time-efficient
Add analytic methods to Distribution that all return NULL. Overload with analytic results where possible, decorators first see if return NULL, if so overload with numerics otherwise ignore.

Option 2. seems better

Automatic generation of d/p/q/r

One of the most powerful features of distr is the ability to automatic generate the d/p/q/r functions if not all are supplied to a constructor. Questions to consider now are if the current generating functions for these can be brought forward to an R6 upgrade or if there are other methods for doing so?

Decorators in R6

As part of the design package I'm doing a feasibility sketch, i.e. are our design choices possible in R6? Specifically the Decorator design pattern and Multiple Inheritance.

So far I have the following work around for Decorators. In this way there are no concrete decorator classes that can be constructed but instead functions in the constructor for the Distribution object that add methods from decorators when requested. This has the following consequences:

Decorators can only be added to distributions in construction
Decorators cannot be directly initialized and therefore distributions with or without decorators will be of the same class and will externally look identical
Distribution objects cannot have a locked environment, which means that in theory users could add or remove methods/variables (locally)
In terms of extensibility, this implementation means that it is simple to add new Decorators or to add new methods for Decorators

If anyone has any comments on this let me know, if not I'll close this issue after a week and proceed with this style of implementation for decorators.

The Onus of Support

Let X be a distribution with pdf f_x and a support on the interval [a, b]. By definition, f_x(c) = 0 for c < a or c > b. But this can be implemented in one of two ways:

A user calls f_x(c), at which point the algorithm first checks to see if c is in the support and returns 0 if not or evaluates the pdf at c, if it is. Or,
A user calls f_x(c) and conditional statements within the pdf formula check to see if c is in the support.

The difference comes down to deciding if the user should implement these checks in a custom distribution (i.e. via Distribution$new(...)) or if the API design should be such that a call first to liesInSupport() is performed before calling the user-defined pdf/cdf.

I am learning toward (1.) as it means there is less room for user-error, also it simplifies the formulas required for wrappers such as truncation.

Convolution flexibility

In terms of the user experience I can see three options for convolution, and more generally any functions that generate a new distribution from two or more constructed ones:

By example,

Z = Exp$new(rate = 1) + Exp$new(rate=1)

Then Z$pdf gives a function involving rate, which gives three options:

Should Z have none of its own parameters and rate is a fixed parameter at '1' (i.e. the value at convolution call)
Should Z have its own parameter set with parameters rate1 and rate2, which refer to the separate Exponential distributions, which can both be updated
Should Z be a wrapper with access to the two distributions and the pdf references the parameter sets of these.

The first option is the least flexible but requires the least amount of work. Arguments for this option are a) that the user can re-call the convolution after updating the parameters of the distributions and b) strictly speaking convolution of distributions is dependent on parameters and therefore they should remain fixed. Arguments against this (apart from flexibility) are that the user may want to keep working with the original distributions but only change their parameters after the convolution.

The second and third option are both considerably more work but add greater flexibility for updating parameters at anytime. There is a possible midpoint between the first and third options in that there is a wrapper with access to the original distributions, but the parameters cannot be edited after convolution.

Any thoughts anyone has on this is appreciated.

"exotic" function-valued properties and extensibility

Just to add this as a discussion point:

for various reasons, one may like to have some more exotic functionals related to distributions, such as n-th (anti-)derivatives of pdf, lp-norm of the various distribution characterizing functions, or mgf/cgf.

It's reasonably easy to plan them in from the start if we wanted to, but given any design, one may ask the question:

how easy is it to add one more exotic functional
should there be specific functionality to ask for "l2-norm of" etc
what about automated conversion functionality

Improve functionality with magrittr

Due to the way R6 classes are stored, functionality is not as it should be. For example the following isn't currently possible Normal$new() %>% decorate(CoreStatistics). Whilst N <- Normal$new(); N %>% decorate(CoreStatistics) does work the message produced is incorrect: ". is now decorated with CoreStatistics"

Joint Distributions as wrappers

Our last discussion concluded that conditional distributions should be implemented as wrappers. This seemed like the natural choice as wrappers inherit methods from DistributionWrapper that get all wrapped models and overload any methods/variables as required. However we also concluded that joint distributions inherit from Distribution and are not wrappers. This seems counterintuitive as the two are naturally related.

I would argue that a better option may be to have constructors for JointDistribution and ConditionalDistribution with methods to get one from the other where appropriate/possible and exploiting properties of independence etc. where possible.

Thoughts?

Restructure of Distributions and Decorators

After a day of going back and forth with the best way to juggle between a clean interface and preventing problems with numerical calculations when analytics are available and dealing with impution of p/d/q/r, I think I have come up with a good compromise.

Distribution objects do not contain any numeric methods. If a user wants to define a custom Distribution using only p/d/q/r then analytic results are only available if supplied by the user. This also includes p/d/q/r. If only the pdf is supplied then none of the others are unless the relevant decorator is called (see below).
Similarly, specifically implemented distributions do not contain any numeric methods. All core functionality is analytic only, which are all based on known results.
There will be three Decorators (so far): CoreStatistics, ExoticStatistics, FunctionImputation
- Each of these will contain numeric calculations, based on what is already provided. For example the Survival function in ExoticStatistics will be 1 - cdf, if cdf is provided, else it will be a numeric calculation using integrate.
- The FunctionImputation wrapper will give the user the choice of imputation method but generally will follow the hierarchy: D2P, P2Q, Q2R, via numerical integrate
Finally methods such as Convolution will be analytic by default but numeric with warning otherwise. i.e. these won't be two separate functions but will just use whatever is available. So if a user calls Convolution$new(Bernoulli, Bernoulli) then this will default to a Binomial, but if it is called on custom distributions, the result is computed numerically (with warning). Same goes for other wrappers.

The result of this is that not all Distributions will have the same methods available as part of the core functionality (not to be confused with the CoreStatistics functionality). For example say that there is no analytic result for the Kurtosis of an Exponential distribution (there is, it's 9), then the Binomial distribution core would have a method Kurtosis whereas Exponential would not, but this would be available as a numeric result after calling CoreStatistics$new.

Finally, in constructing a decorator, we include the parameter show.warnings which if TRUE, means that a message (not a warning as these stack and become incredibly frustrating) is produced saying that the result is approximate only, or if FALSE, doesn't produce any warnings.

@fkiraly I think this may be in line with what you were originally thinking. If it is, sorry it took me so long to reach it!

Again, thoughts and comments appreciated.

Summary method for distributions

The current suggestion for the print method is to have the familiar distribution name followed by parameters, e.g. Binomial(10, 0.2). However for summary() it may be sensible to have various arguments that determine how much information is provided, for example the summary method could return all reasonable statistical functions of interest, e.g. mean, var, kurtosis. Alternatively in the simplest case it could just provide the distribution along with named parameters and values.

Implementation of Conditional, Compound, Truncated/Huberized and Lebesgue Distributions

Should these be implemented as:
i) Inherited classes
ii) Composite classes
iii) Wrappers (i.e. composite classes without being formally defined as a new class, identified via some sort of flag, e.g. Truncated=TRUE)

Method/Variable Naming

There are a number of decisions to make regarding how to name methods, I'm talking specifically about the actual method names and not the case they're written in. These decisions include:

Should the R6 and S3 methods have the same names? e.g. D$kurtosis() == kurtosis(D)
Should we adhere to the conventional R stats names or use more informative ones? e.g.1 D$p(1) or D$pdf(1) or D$density(1) e.g.2 D$mean() or D$expectation() or D$E()
Should we have multiple names for the same function? e.g. D$p(1) and D$pdf(1) and D$density(1)

My personal opinion is that S3 and R6 methods should have the same name but at times it may be worth introducing extra S3 methods to be in line with common functions. For example it may be worth having all of: D$expectation(), D$mean(), expectation(D), mean(D) or as pointed out by @PeterRuckdeschel as expectation is more informative, perhaps only: D$expectation(), expectation(D), mean(D) , where the final mean is kept in line with common practices in R. Additionally I think it may be useful to have both pdf(D) and pmf(D) or density(D) and mass(D). Extra pointer functions are cheap for S3 methods, so I think by having only one representation for each in R6 we can then use S3 to define multiple names (although I understand from an abstraction perspective this may be a bad idea).

Domain specification?

Have we decided how, or whether, to specify the domain in which the random variable can take values?

This may make sense in multiple ways:
(i) the set where the r.v. with the given parameters may take values
(ii) the set where a r.v. within the family may take values
(iii) the "smallest standard superset" of where it may take values, e.g., R, R-plus, N, Z, etc
(iv) a combination of the above

I think (iii) makes type checks easy in cases where you combine random variables arithmetically, e.g., if you want to ask if they are both non-negative real etc you don't care about the precise support.

Whereas (i) is useful if you want to infer bounds or create composites.

An obvious sub-question is if there should be multiple domain specifications - e.g., one of the precise support, one of a more generic "scientific type".

R62S3 on load or attach

R62S3 is used in the zzz.r file. We could either call the function in loading of the package or attaching. In the former case, functions are loaded into local memory when the package is installed and loaded. In the latter, functions are loaded into local memory on use of library(distr6). I think the latter makes more sense UNLESS there is some reason why someone may want to use the S3 dispatch via distr6::somefun(...) but I think that this already adds more cumbersome interfacing than would be desired for the simple S3 dispatch methods...

Confirm all documentation latex compatible

All equations in \eqn and \deqn should be of a standard latex format with correct escape sequences, greek letters, etc. Need to check pdf vignettes to check these

Removing VariateForm and ValueSupport classes

Given problems with multiple inheritance and the relatively few number of methods that are actually defined in one of the abstract classes for continuous/discrete/univariate etc. is there actually any reason to have these defined as separate classes?

I think we should consider simplifying the structure and opting for a behavioural pattern instead that checks the properties/traits and calls the relevant method (as opposed to identifying the class hierarchy and dispatching on this).

This reduces the number of classes and problems with multiple inheritance and will later help in extensibility/scalability.

Does anyone have any opinions against this?

Properties vs Traits

Decide:

i) How best to separate 'properties' and 'traits', if indeed this separation is sensible
ii) What should be included in both
iii) If these should be defined as R6 classes or named lists

Multiple Inheritance in R6

In short multiple inheritance does not work in R6. I have also seen a lot of arguments against multiple inheritance in general and therefore wonder if we can find a different design pattern that may work. I suggest using a form of the Adaptor pattern as follows:

We have the same initial hierarchy, i.e. Distribution <- and then splits into ValueSupport and VariateForm
Then an XvTDistribution Adapter that has one field: list(ValueSupport, VariateForm)
All calls to specific implemented distributions go via the adapter and the 'correct' method is then selected

Pseuo-code:

XvTDistribution <- R6Class(public = list(Type = list(ValueSupport, VariateForm)),density())
BinomialDistribution <- R6Class(inherit=XvTDistribution)
BinomialDistribution$new() $>$ density()

# This calls the density function in the parent adaptor 'XvTDistribution', which identifies
#  the caller as Discrete and Univariate and adapts the call to the relevant density functions.

Truncated Distribution Inheritance

Truncated distributions are currently a wrapper, inheriting from the generic wrapper class, which inherits from Distribution. However as truncated distributions are technically conditional, should they directly inherit from ConditionalDistribution or does this not matter as everything is overloaded as required when wrapping?

Type Hierarchy

distr uses the following hierarchy:

Distribution --> Univariate/Multivariate --> Discrete/Continuous

Python uses the following:

Distribution --> Discrete/Continuous --> Univariate/Multivariate

From reading through the distr vignettes I am assuming, but please correct me if wrong, that the former was chosen as you systematically implemented univariate distributions before moving to multivariate. However I believe that there are more interface design differences between continuous/discrete distributions than univariate/multivariate, so should we instead be considering the latter option?

Expectation and Variance approximations for custom distributions

What is the best way to approximate the expectation and variance for distributions (continuous or otherwise) over a possibly infinite support?

L2Deriv and FisherInfo

How best to implement this information and how much is required vs. useful? Which will inform abstraction amount.

Sets

We use sets at least three times in the current designs, these include in 'Type', 'distrDomain' and 'support'. I've been playing around with the 'sets' library, which allows for common sets reals() integers() as well as basic set operations (intersection, union) and (in)equalities such as subsetting. The print method isn't amazing but it's informative and that's what counts!

For example

reals()
> [-Inf, Inf]
integers()
> -Inf...Inf
naturals(r = 10)
> 1..10

If anyone can think of a good reason to define our own object let me know, to me this looks good.

getters/setters/convention

An issue regarding convention, how should we proceed with getters and setters?

Either the strict OOP route of every variable and constant being private and only accessible via get() or we could allow some variables (e.g. dictionaries, names) to be public and accessed via distr$variable. I personally think the former may be preferred as it is very easy for a user to accidentally overwrite a public variable but much harder to do this accidentally to a private one.

Additionally, for elements in dictionaries, for example traits, should these be accessible via individual getters or via method chaining. e.g. to get 'type':

# Option 1 via individual getter
distr$getType()

# Option 2 via chaining
distr$getTraits()["Type"]
distr$getTraits()$get("Type") # Equivalently
distr$getTraits() %>% get("Type") # Equivalently

The first is more work and less flexible, the second requires slightly more user effort. In terms of extensibility, I think the second may be preferred.

I will opt for strict OOP getters/setters and method chaining and close this issue after a week if no-one has a strong opinion against.

Core Statistics Improvements

This is low priority and can be moved into V2. All CoreStatistics decorator functionality is now in place and passing tests in testthat. i.e. functions, including expectation and generalised kth moment, generated from discrete and continuous distributions are equal to expected analytic results.

The question is whether we should allow these to be further customised by the user to supply their own functions, or if this is sufficient as individual functions can be overloaded when required.

Traits as variables or methods

Due to the CRAN notes problem, properties are defined as follows:

.properties is a private list holding the property details
*. properties() is a public getter

But traits are defined by a single traits list which is a locked environment (therefore un-editable). For consistency should we change traits so that it is defined like properties above?

Plottable structure

Following conversation with Peter, a good idea is to have a plot() method that can return a "plottable structure" as well/instead of a plot to a graphical device. By a plottable structure we mean some sort of list that has details to be passed to plot/ggplot or that can be used for analysis. Additionally the generalised qqplot given in distrMod is definitely a good idea to carry forward. The 'issue' is to ascertain what details should go into this plotting structure that can be passed to either plot or ggplot and what the default plot should be for a simple call to e.g. plot(Norm()).

Change pdf/cdf names

As most methods are not abbreviations but full names, it might be more in line to rename as follows: pdf -> density and cdf -> cumulative this has the added benefit that defining pdf as an S3 generic doesn't mask the grDevices::pdf function.

S3 Distribution Constructors

There is a very strong argument to add S3 constructors for every implemented distribution for the less technical user.

e.g. Binomial$new(size = 10, prob = 0.5) -> binom(size = 10, prob = 0.5)

However to do so we need to be careful about the name of the constructors. I opt for the short name, i.e. Binomial = binom, Normal = norm, however this creates problems with things like Exponential = exp

Exploiting analytic results for changing parameters

By example: the convolution of indep. Binom distributions with the same probability can be expressed by
sum_i Binomial(n_i, p) = Binomial(sum_i n_i, p)

This result holds only when each Binomial distribution has the same probability. Therefore if a user calls Convolution$new(Binom1, Binom2) for two Binomials with the same prob, should the resulting wrapper have the parameters: Binom1_size, Binom2_size, prob or Binom1_size, Binom2_size, Binom1_prob, Binom2_prob. In the case of the former the above result is exploited but the user loses the flexibility of setting both internal probabilities separately. In the latter case we could either have a validation check which uses the above results when Binom1_prob == Binom2_prob or the analytic result of the convolution of two general binomials (not necessarily equal prob), which reduces to the above when probs are identical.

@fkiraly Could deriving the convolutions of each sensible distribution combination be a good exercise for the interns as part of implementing the distributions?

Traceability

The question of traceability was raised by @PeterRuckdeschel as he noticed that the distr code can be quite complex for a non-technical user to understand, especially in relation to how an automatically generated function was created. The suggestion was to have a separate variable that pulls out the relevant information for a non-technical user should they want to understand the transformation or generation.

My personal opinion (and correct me if I'm wrong but I believe @PeterRuckdeschel noted the same) is that a technical user can probably navigate the complex code and a non-technical user may not actually care too much about this traceability in the first place, thus making this extra feature more cost than benefit.

But if anyone has any thoughts to add, please share them here.

Is imputation as wrapper correct?

Currently we have that imputation of d/p/q/r should be wrappers, for example for a distribution X with a pdf and no cdf, a user can impute the cdf via:

ImputeCdf$new(X, strategy = "pdf2cdf")

However the problem with this is that if a user imputes p/q/r all from d then they end up with a distribution called imputedR_imputedQ_imputedP_X with parameters imputedR_imputedQ_imputedP_param1.

I think it is more sensible to have these imputations as functions that take as input the distribution and alter the respective function. e.g. imputeCDF(X) would return X but with the cdf imputed via a given strategy. Then we could include dynamic comments in the function and add to the distribution description a note detailing the imputation.

Thoughts?

Lists vs Dictionaries vs Classes

For uniformity across all distribution classes and to enable abstraction where possible, we aim to reduce the number of classes that are taken as parameters/slots. To this end I can think of three options:

Using R basic lists
Using dictionaries
Keeping with the current classes

By example take BinomFamily() with the following structure (name on LHS, class on RHS):

L2Deriv - EuclRandVarList
L2deriv.fct - function
L2derivSymm - FunSymmList
L2derivDistr - UnivarDistrList
L2derivDistrSymm - DistrSymmList
FisherInfo - PosSemDefSymmMatrix
FisherInfo.fct - function
param - ParamFamParameter
modifyParam - function
fam.call - call
startPar - function
makeOKpar - function
name - character
distribution - Binom
distrSymm - SphericalSymmetry
props - character
.withEvalL2derivDistr - logical
.withMDE - logical
.withEvalAsVar - logical

Say we want to instead have L2Deriv as a list or dictionary. L2Deriv currently has the following slots/sub-slots:

[email protected][[1]]@Map[[1]] - Map function
[email protected][[1]]@Domain@dimension - Dimension of domain
[email protected][[1]]@Domain@dimension - Name of domain
[email protected][[1]]@range@dimension - Dimension of range
[email protected][[1]]@range@dimension - Name of range

Now instead as a dictionary:

L2deriv$get("Map")
L2deriv$get("DomainDim")
L2deriv$get("DomainName")
L2deriv$get("RangeDim")
L2deriv$set("RangeDim", 2)
L2deriv$get("RangeName")

Or as list:

L2deriv[["Map"]]
L2deriv[["Domain"]][["Dim"]] (alternatively L2deriv[["DomainDim"]]
L2deriv[["Domain"]][["Name"]]
L2deriv[["Range"]][["Dim"]]
L2deriv[["Range"]][["Dim"]] <- 2
L2deriv[["Range"]][["Name"]]

By using the latter two we avoid having to define a getter/setter for each of the slots/sub-slots. A dictionary is a slightly more flexible R6 approach but a list is more familiar to the average R user.

Note: The above is only an example and I'm not suggesting that L2Deriv necessarily should be reduced in this way, only that whatever approach we take should be consistent across the whole interface.

Calculating mode of a distribution

For a general probability distribution, what is the most efficient way to find the mode? For continuous univariate functions, the function optimize should be a good approximation, but doesn't work for discrete distributions. Perhaps @PeterRuckdeschel can advise based on how distr does this?

Documentation of private variables/methods

I've seen arguments online for and against this practice.

Arguments against:

The user should only be using public variables/methods and therefore any documentation of private ones detracts from what matters and can be confusing

Arguments for:

To encourage users to manipulate and edit the core functionality as they like to experiment with pushing the package further

My opinion is to opt for a compromise of the two. Use docstrings/roxygen for public variables/methods which will then show up in the R help pages and documentation. Use comments for private variables/methods which will only show when accessing the functions.

All your opinions appreciated.

Add documentation for S3 methods

Use of R62S3 means that all S3 methods must be documented. Which may help answer #40. Documentation may therefore need a rethink for best practices

Debugging

One thing that has come up multiple times in these issues and discussions is deciding when to perform validity checks and when/what errors to throw.

My suggestion is two-fold:

For validation checks we mimic the checkmate package which has three main types of checks, by example:

testNumeric(x) # TRUE if x is numeric, FALSE otherwise
checkNumeric(x) # TRUE if x is numeric,  String "Must be of type 'numeric', not '...'" otherwise
assertNumeric(x) # invisible(x) if x is numeric, Error: Assertion on '"x"' failed: Must be of type 'numeric', not '...'.

And we make similar as required, e.g. test/check/assertDomain

Whenever a validation check is possible we add in the respective test/check/assert. These are wrapped in a conditional statement controlled by a parameter debug with possible arguments test/check/assert/ignore which call the respective check or skip validation altogether

This puts debugging fully in control of the user and is relatively efficient. By default I think check is the safest in that it won't crash the code but will return a warning message. Packages such as mlr use assert as default.

Unlock binding CRAN cmd check notes

Currently on build there is a note about unsafe calls to unlockBinding, this is currently required in construction of distributions and on change of support. There are clear ways of fixing this, e.g. by introducing more getters/setters and moving variables from private to public. If we can give a clear argument as to why we rely on unlockBinding however then it will pass CRAN submission. We can return to this before submission or see how it goes at the time.

Estimate/Fit Methods

What is the best way to implement an Estimate style of method? Some possible suggestions:

Pass instance of class into Estimate() method but then need a way to call d/p/q/r altering the parameters after construction
Estimate() creates a new instance of the distribution over all possible parameter values and returns the instance with the optimal parameter values
Create sub-class of distr with overloaded d/p/q/r etc that requires input of parameter that is previously unspecified

Abstraction of Distribution and ProbFamily

Currently the distr design has the Distribution classes such that their main function is the p/d/q/r functions, as well as being available for statistical functions (e.g. mean, var, etc.). The ProbFamily classes contain more detailed information about the distribution itself, including all details about parameterisation. As pointed out by Peter this abstraction is sensible as one the parameters are given, the distribution doesn't require any other knowledge to make use of p/d/q/r and other information could overwhelm the user. The question is then how best to perform this abstraction and to what level?

Global vs Local Options

As identified in distr, there are many internal options that can be set by the user that could seem overwhelming or get in the way of clean code, hence these are made global options. The advantage of global options is clear code, less parameters for the non-technical user to worry about and once set they can be ignored. The disadvantage is that they may be forgotten before usage or may put-off a lazy user.

I think it is worth taking the time to go through all these options and classifying them as follows:

Important enough that a non-technical user may want to change them and as such they would be better suited as local parameters with default values
Important but won't affect the user-experience (especially for a non-technical user) and therefore keeping them in global options is a good idea so the more technical user can alter as required
Options that even a technical user is unlikely to change and therefore shouldn't be accessible to the user

To demonstrate these by example:

Looking at the R package mlr. In this package the logloss score is defined in such a way to avoid a problematic value of log(0). This is achieved by setting a tolerance parameter eps to 1e-15. This is hard-coded into the function and is therefore determined to be Option 3, whereas it could have been considered Option 1 or 2.
The function integrate includes a rel.tol parameter to adjust relative accuracy. This defaults based on system settings. This is classified as Option 1 (and the default is based on Option 2)

R62S3 compatibility issues, upgrade to R62S4 or R62Fun

Due to complicated and precise documentation structures required in Roxygen, automatically generating and assigning S3 generics and methods is proving virtually impossible. However, there is no good reason to implement the dispatch system anyway as every single method is the same internally and therefore not class independent. On release of R62S3 v2, implement R62Fun in the zzz.r file.

Should assertions be exported?

Exporting assertions such as test/check/assert/isLeptokurtic means that anyone can use these functions externally, otherwise they only exist as internal helper functions. It currently takes my computer around half a minute to load the package, so there is definitely an argument to not export them. However a modelling package using distr6 may benefit from these assertions.

Wrappers in R6

Finally, the question of how to implement wrappers. In the following tests I have tried two methods:

Have the original class as a field and then a call to this original is required for any 'old functions'
Overload the '$' operator to call the original function. This could break R6 if done wrong so needs careful thought.

The third method is similar to suggested for decorators/inheritance where the methods and variables of the old class are copied individually to the wrapped class.

(Semi-)Final methods and variables

The complete list of suggested classes and methods for distr6 can be found here. Of course this is not finalised and things will change in implementation but this will mark the end of the initial designs. This issue is for discussion regarding these, for example is anything obvious missing or should any functions be moved between decorators, etc.

If there are no opinions on this/everyone is satisfied, I'll close this down in a week.

Documentation - do or don't repeat yourself

The roxygen mantra is do repeat yourself in documentation. For inheritance this can get quite tedious. Whilst roxygen supports inherited methods, if we inherit every method down every tree, this could get very messy and with method overloading it could also be quite complicated. However, by not either inheriting (via docstrings) or adding each inherited method manually, the user is left to follow a chain of links to find all methods.

Which of the two do we opt for:

Do inherit methods in documentation, via roxygen or manually, leading to long and potentially confusing documentation.
Do not inherit methods, leading to short documentation with missing methods/variables but with links (possibly many) to point to the relevant method/variable documentation

Streamline Distribution checks

All the checks around type, distrDomain, support, valueSupport and variateForm could be better streamlined. All can be used to define each other and defaults aren't clear. This is a low priority for now and can be looked at once stable.

distr functions to carry forward

Decide if/how to carry forward the following:

m1df(upper) - First moments clipped to ‘upper’ parameter
m2df(upper) - Second moments clipped to ‘upper’ parameter
isKerAinKerB – Checks if null space of A is a subspace of null space of B
EuclideanNorm – Columnwise evaluated Euclidean norm
QuadFormNorm – Columnwise evaluated norm
gaps(AbsContDistribution) - Returns where there are gaps in the support of a AbsCont distribution
Asymmetric Total Variation Distance
checkL2Deriv(Distribution) - Returns precisions and centering of Fisher Info and max deviation

'Working Support' and numerical exactness of distribution representations

I have added a private variable with private getters/setters called workingSupport and I'm not entirely sure how useful it is yet and how it should be combined with support. The purpose of workingSupport is two-fold, 1. to increase speed/efficiency for numeric calculations, 2. to reduce the size of the support of a distribution, especially when non-finite. The function .setWorkingSupport() works as follows:

If distr$sup() = Inf {
  sample = dist$rand(10000)
  newsup = max(sample)
  while(distr$pdf(newsup) >  .Machine$double.eps)
      newsup = newsup + 1
  return(newsup)
}

Analogously for distr$inf().

Currently this is quite inefficient, particularly in sampling but before I worry about any improvements I have the following questions:

Is this genuinely useful?
If so, should it replace the 'true' support or should it be kept in private

Truncation/Huberization without cdf

We agreed not to automatically generate the cdf if not supplied. In this case, how should a truncation/huberization wrapper be implemented? Currently for a dist X with pdf, f(x) and cdf, F(x), the pdf of trunc(X) at limits (a,b) is implemented via pdf(trunc(X)) = f(x)/(F(b)-F(a)). Huberization relies on cdf for the values at the borders.

So when cdf is not supplied should we:
a) throw an error that breaks the code
b) calculate cdf via numerical integration (when required) and throw a warning/message
c) use a different method for truncation/huberization (of which is currently unknown to me)

I personally lean towards b) with a message added to the description in the wrapper along the lines of "trunc(X) calculated via numerical integration of f(x), therefore results are approximate".

xoopr / distr6 Goto Github PK

distr6's People

Contributors

Stargazers

Watchers

Forkers

distr6's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs