jonasmoss / univariateml Goto Github PK

View Code? Open in Web Editor NEW

7.0 2.0 5.0 6.75 MB

An R package for maximum likelihood estimation of univariate densities.

Home Page: https://jonasmoss.github.io/univariateML/

License: Other

R 97.78% TeX 2.22%

estimation density maximum-likelihood

univariateml's Introduction

univariateML

Overview

univariateML is an R-package for user-friendly maximum likelihood estimation of a selection of parametric univariate densities. In addition to basic estimation capabilities, this package support visualization through plot and qqmlplot, model selection by AIC and BIC, confidence sets through the parametric bootstrap with bootstrapml, and convenience functions such as the density, distribution function, quantile function, and random sampling at the estimated distribution parameters.

Installation

Use the following command from inside R to install from CRAN.

install.packages("univariateML")

Or install the development version from Github.

# install.packages("devtools")
devtools::install_github("JonasMoss/univariateML")

Usage

The core of univariateML are the ml*** functions, where *** is a distribution suffix such as norm, gamma, or weibull.

library("univariateML")
mlweibull(egypt$age)
#> Maximum likelihood estimates for the Weibull model 
#>  shape   scale  
#>  1.404  33.564

Now we can visually assess the fit of the Weibull model to the egypt data with a plot.

hist(egypt$age, freq = FALSE, xlab = "Mortality", main = "Egypt")
lines(mlweibull(egypt$age))

Supported densities

Name	univariateML function	Package
Cauchy distribution	`mlcauchy`	stats
Gumbel distribution	`mlgumbel`	extraDistr
Laplace distribution	`mllaplace`	extraDistr
Logistic distribution	`mllogis`	stats
Normal distribution	`mlnorm`	stats
Student t distribution	`mlstd`	fGarch
Generalized Error distribution	`mlged`	fGarch
Skew Normal distribution	`mlsnorm`	fGarch
Skew Student t distribution	`mlsstd`	fGarch
Skew Generalized Error distribution	`mlsged`	fGarch
Beta prime distribution	`mlbetapr`	extraDistr
Exponential distribution	`mlexp`	stats
Gamma distribution	`mlgamma`	stats
Inverse gamma distribution	`mlinvgamma`	extraDistr
Inverse Gaussian distribution	`mlinvgauss`	actuar
Inverse Weibull distribution	`mlinvweibull`	actuar
Log-logistic distribution	`mlllogis`	actuar
Log-normal distribution	`mllnorm`	stats
Lomax distribution	`mllomax`	extraDistr
Rayleigh distribution	`mlrayleigh`	extraDistr
Weibull distribution	`mlweibull`	stats
Log-gamma distribution	`mllgamma`	actuar
Pareto distribution	`mlpareto`	extraDistr
Beta distribution	`mlbeta`	stats
Kumaraswamy distribution	`mlkumar`	extraDistr
Logit-normal	`mllogitnorm`	logitnorm
Uniform distribution	`mlunif`	stats
Power distribution	`mlpower`	extraDistr

Implementations

Analytic formulae for the maximum likelihood estimates are used whenever they exist. Most ml*** functions without analytic solutions have a custom made Newton-Raphson solver. These can be much faster than a naïve solution using nlm or optim. For example, mlbeta has a large speedup over the naïve solution using nlm.

# install.packages("microbenchmark")
set.seed(313)
x <- rbeta(500, 2, 7)

microbenchmark::microbenchmark(
  univariateML = univariateML::mlbeta(x),
  naive = nlm(function(p) -sum(dbeta(x, p[1], p[2], log = TRUE)), p = c(1, 1)))
#> Unit: microseconds
#>          expr     min       lq      mean   median       uq     max neval
#>  univariateML   259.2   348.75   557.959   447.05   536.40  5103.5   100
#>         naive 15349.1 15978.35 16955.165 16365.45 17082.25 48941.4   100

The maximum likelihood estimators in this package have all been subject to testing, see the tests folder for details.

Documentation

For an overview of the package and its features see the overview vignette. For an illustration of how this package can make an otherwise long and laborious process much simpler, see the copula vignette.

How to Contribute or Get Help

Please read CONTRIBUTING.md for details about how to contribute or get help.

univariateml's People

Contributors

Stargazers

Watchers

Forkers

tnagler vbaliga leo-rain anhnguyendepocen oezgesahin

univariateml's Issues

CRAN release

Would it be possible to make publish a new CRAN version in the next 1-2 weeks? (I want to publish a new package which relies on univariateML). I can help preparing everything if necessary.

Document the minimal demands for tests, documentation, and programming for new densities.

Add this to the wiki.

Use packagedown.

Read and follow https://pkgdown.r-lib.org/reference/build_home.html
Read and follow Github docs
Make sure everything looks presentable.

Document plot, lines and points.

Add examples to the ml functions.

Ready for CRAN submission and write JOSS paper.

Make documentation site deploy automatically.

I currently copy and paste the docs into netlify. It would be nice to do this automatically.

Make test for model_select.

Make tests for model_select that makes sure all functions work.

Unit tests: Add support to unit tests.

Add support attribute testing to the unit tests.

Should add tests for dml, pml and qml for all ml options.

Make a copula vignette.

Make a small copula vignette à la Vinnie Ko.

Consider Rfast!

Rfast is a package with many univariate densities implemented. Most if not all of the implementations have a higher speed than the implementations in this package and the overlap is large.

Do one of these:

Import Rfast and make use of their implementations;
Make univariateML GPL and adopt the code from Rfast.

Add "safety" pml, qml, rml.

Not every density has a CDF, quantile function, or random variate generator. Make safe surrogates in the pml, qml and rml functions.

Better documentation for `pml` etc.

Better docs and some examples.

Better automated tests.

Currently, most of the automated tests for the ml*** functions are copy-pasta. This has two downsides: 1.) It is hard to verify if each test is complete. 2.) I we want to implement more tests or change the object structure, we would have to modify 22 tests.

One test file that avoids this problem is test_input_checks.R. This goes through every ml*** file with just a couple of lines.

Most of the tests are straight-forward such as checking whether the input checks work, if the objects have all the needed attributes, etc. I propose the following pattern:

fun = function(x) eval(call("attr", match.call()[[1]], "type"))
attr(fun, "type") = "continuous"

Then

fun()
# > [1] "continuous"

These function attributes can then be used to do both testing and populating attributes of the univariateML objects. Take mlcauchy and consider

attr(mlcauchy, "model") = "Cauchy"
attr(mlcauchy, "name") = "mlcauchy"
attr(mlcauchy, "density") = "stats::dcauchy"
attr(mlcauchy, "parameters") = c("location", "scale")
attr(mlcauchy, "test_call") = "stats::dcauchy(n, 0, 1)"
attr(mlcauchy, "support") = c(-Inf, Inf)

This setup allows us to test a lot of things easily (currently untested), such as the existence of the appropriate d*** and r*** functions and whether the parameter names are correct. (They should be the first parameters after x.)

Potential problems: The attributes can be removed from the function mlcauchy, which will break the function in most scenarios. An alternative is to start each function with a variable such as name = quote(mlcauchy) which identifies the function instead of using match.call().

Fix input value in ml**** functions.

Change
#' @param x The data from which the estimate is to be computed.
to something better, e.g. "Numeric vector." precision is good.

Extended input checks

These should probably throw an error:

> mat_3cols <- replicate(3, rexp(100))
> mlexp(mat_3cols)
Maximum likelihood estimates for the Exponential model 
rate  
1.07

> mlexp(letters[1:5])
Maximum likelihood estimates for the Exponential model 
rate  
  NA  
Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA

mlnaka() does not ensure parameter bounds

There seems to be a bug in mlnaka():

set.seed(1)
x <- rexp(50)
fit <- mlnaka(x)
fit
#> Maximum likelihood estimates for the Nakagami model 
#> shape   scale  
#> 0.4638  1.7477 

pml(x, fit)
#> Error: 'shape and scale are not valid parameter vectors

As far as I understand, the shape parameter must be >= 0.5, but that's not ensured by the fitting procedure.

Make an overview vignette.

Make an overview vignette. It should contain:

AIC and BIC comparisons
Plotting
Ordinary PP plots and QQ plots
Comparative QQ plots
Parametric bootstrapping.
Usage of density, cdfs, quantile and random variate generation functions.

Then remove parts of the large readme and redirect them to the vignette.

Make P-P plots and Q-Q plots.

Should make qqmlplot modelled after stats:qqplot. In addition a ppplot should be made.

qqmlplot: Basic qq plot functionality.
ppmlplot: Basic pp plot functionality.
Basic documentation modeled after stats:qqplot.
Basic examples for both.
Make unit tests for 100% coverage.
Write more documentation with references and an explanation about the difference between this simple qq plot and the more sophisticated qqplot used in stats::qqnorm. Must have references here.

Add more densities.

We should add more distributions to the package. Two points to keep in mind: The densities should have few parameters, so no normal mixtures with 10 parameters. Two, they must be implemented in another package with at least the density function d*** implemented.

extraDistr

The package extraDistr has these non-implemented distributions that might be interesting to our users.

actuar

The actuar package contains a lot of heavy-tailed distributions such as.

Loglogistic: Transform and use the current logistic ML.
Loggamma: Transform and use the current gamma ML.
InverseWeibull: Transform and use the current Weibull ML.
Burr: Might have the same kind of problems as Lomax.
InverseGaussian: Use this package instead of statmod to reduce the number of dependencies.

Some other distributions

Here are some other distributions. Be sure to check the quality of the implementation before making an ML function. If the implementations are too bad, make a small package containing only the d***, p*** functions and submit it to CRAN.

Foldnorm: In VGAM
LogitNormal: logitnorm package. Very easy to implement.
GenGamma: In flexsurv package.
EMG: In EMG package

Add confint generic.

mllomax: When does the MLE exist? Does the current algorithm work?

The MLE doesn't always exist for the Lomax distribution; its maximum would be at lambda = 0 in these cases. When it doesn't exist, the sequence f(x; lambda, kappa) will converge to an exponential. This is handled by checking if the optimizer gets really close to 0 in lambda. The current algorithm is a simple Newton-Raphson, but the function is not convex. Will it always work?

Find and implement a simple check for the existence of the MLE.
Will the algorithm work when the MLE exists? If not, make one that works.

Use checkmate.

checkmate can probably be used for better and faster input checks.

Check all ML functions against the literature.

First task

Make details for each distribution.

Second Task

Check all ML functions against the literature.

Update documentation on Netlify.

Add, document, and test a parametric bootstrap with map-reduce functionality.

Shouldn't be too hard.

Make the documentation of `univariateML_models` list the available models.

Support for discrete distributions

Would be nice to support some distributions with discrete support (like Poisson or Binomial). Are there any reasons to not support them in this package? I might work on a PR at some point, but wanted to check with you first.

P.S.: Great, great package! I always wanted to do something like that, but never got around to it. Relieved that I can cross that off my list ;)

Default plotting range not working for uniform distributions

Function plot_wrangler (and hence the plot, lines and points methods) doesn't work for uniform distributions if range = NULL:

library(univariateML)
plot(mlunif(0:1))

produces the following error:

Error in abs(support[1]) : non-numeric argument to mathematical function

Docs: Parametric boostrap.

Add reference to parametric bootstraps.
Note the importance of pivots.
Explain / document the rôle of the map and reducers, maybe with a reference.
More tests for reducers and mappers.

Unit test: Tests for dml, pml, qml, and rml for all ml options.

Should add tests for dml, pml and qml for all ml options.

Attributes on the ml*** functions.

Sometimes we need to know e.g. the support of the ml*** before calling it. For instance, the package kdensity needs both the support and the name of the density. Or by finding the "attr(object, "support")" part of the code.

One way to fix this is to let the functions themselves have attributes.

bug in model_select

Hello,
It seems that the fix resulted in a bug in model_select function.
For all available distributions:

set.seed(1)
x <- actuar::rllogis(500, shape = 3, rate = 2)
univariateML::model_select(x, criterion = "bic")
Error in log(nos) : non-numeric argument to mathematical function

For normal distribution:

univariateML::model_select(x, models = "norm", criterion = "bic")
Maximum likelihood estimates for the Normal model 
  mean      sd  
0.5834  0.3815

For log-logistic distribution:

univariateML::model_select(x, models = "llogis", criterion = "bic")
Error in log(nos) : non-numeric argument to mathematical function

I cannot easily see why it happens.

Fix github actions

Some of them need to be updated to version 2.

installation from GitHub

Hello,
when I tried to install the development version of the package, I got a warning for the actuar package:

> devtools::install_github("JonasMoss/univariateML")
Downloading GitHub repo JonasMoss/univariateML@HEAD
Skipping 1 packages not available: actuar

It seems to result from the actuar's R dependence (≥ 4.1.0) in its last version.
Is there a way to solve this issue besides updating the R version or installing the package via CRAN?

Weighted likelihood fitting

Could be interesting to allow users to supply weights (useful for imbalanced samples or varying coefficients models). We would need to check feasibility first, since other packages this library depends on might not allow for weights. Don't have time right now, but might come back to this in the future.

Fix return values in ml*** functions.

Some ml functions return vectors according to the docs. This is false.

References for every distribution.

Add references to papers or books for the density and the package the density can be found in.

[] dnorm
[] dlogis
[] dcauchy
[] dgumbel
[] dlaplace
[] dexp
[] dlomax
[] drayleigh
[] dgamma
[] dweibull
[] dlnorm
[] dinvgamma
[] dbetapr
[] dwald
[] dbeta
[] dkumar
[] dunif
[] dpower
[] dpareto

Remove the range option from plot and lines.

The range option is non-standard and should be replaced with the ordinary xlim.

Document minimum value on `mllgamma`.

The minimum value is 1, but it should be documentented