GithubHelp home page GithubHelp logo

alexpghayes / distributions3 Goto Github PK

View Code? Open in Web Editor NEW
100.0 100.0 16.0 14.22 MB

Probability Distributions as S3 Objects

Home Page: https://alexpghayes.github.io/distributions3/

License: Other

R 99.65% TeX 0.35%

distributions3's Introduction

Hi, I'm Alex πŸ‘‹

I'm a PhD candidate in the University of Wisconsin-Madison statistics program. My github is a mixture of research code, #rstats ✨ contributions, and personal data analysis projects. I write long-form explainers on my blog, https://www.alexpghayes.com/.

Research software

  • fastadi performs self-tuning matrix completion via adaptive thresholding, often outperforming softImpute. See the paper for algorithmic and theoretical details. I have also extended this algorithm to work with matrices where the entire upper triangle is observed as part of some work on citation networks.

  • aPPR helps you calculate approximate personalized pageranks from large graphs, including those that can only be queried via an API. aPPR additionally performs degree correction and regularization, allowing users to recover blocks from stochastic blockmodels. Read the paper.

  • vsp performs semi-parametric estimation of latent factors in random-dot product graphs by computing varimax rotations of the spectral embeddings of graphs. The resulting factors are sparse and interpretable. Read the paper.

  • fastRG samples random-dot product graphs much faster than naive sampling procedures and is especially useful when running simulation studies. See the paper for a description of the fastRG core algorithm.

#rstats

I am involved in a number of open source projects in the tidyverse and tidymodels orbits. I previously maintained the broom package, which currently has ~6 million downloads, and for my contributions am an author on the tidyverse paper. I intermittently participate in the Stan and ROpenSci communities as well.

Teaching materials

Other projects

Please get in touch if...

  • you'd like to hire me for a research or data science for social good internship,
  • you want to discuss design of statistical modeling software,
  • you want to collaborate on a research project, or
  • you want to write an explainer together.

Outside of R, I'm a proficient Python user, and can pull together enough SQL, C++, and Julia to get things done.

I am responsive via email.

Last updated 2023-10-20.

distributions3's People

Contributors

1danjordan avatar alexpghayes avatar brunaw avatar dickoa avatar dkwhu avatar ellessenne avatar emilhvitfeldt avatar mnlang avatar ndrewwm avatar olivroy avatar paulnorthrop avatar rmtrane avatar zeileis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

distributions3's Issues

Release distributions3 0.1.3

Prepare for release:

  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted πŸŽ‰
  • usethis::use_github_release()
  • usethis::use_dev_version()

pareto distribution

Love the idea of this package πŸ‘.

I often need the Pareto distribution, for which I use actuar::*pareto2. Is it a barrier that this distribution isn't provided in base? Or could it also be incorporated?

Fisher F distribution details section of docs is for the Gamma distribution

I think this is just a copy/paste error:

#' In the following, let \eqn{X} be a Gamma random variable
#' with parameters
#' `shape` = \eqn{\alpha} and
#' `rate` = \eqn{\beta}.
#'
#' **Support**: \eqn{x \in (0, \infty)}
#'
#' **Mean**: \eqn{\frac{\alpha}{\beta}}
#'
#' **Variance**: \eqn{\frac{\alpha}{\beta^2}}
#'
#' **Probability density function (p.m.f)**:
#'
#' \deqn{
#' f(x) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}
#' }{
#' f(x) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}
#' }
#'
#' **Cumulative distribution function (c.d.f)**:
#'
#' \deqn{
#' f(x) = \frac{\Gamma(\alpha, \beta x)}{\Gamma{\alpha}}
#' }{
#' f(x) = \frac{\Gamma(\alpha, \beta x)}{\Gamma{\alpha}}
#' }
#'
#' **Moment generating function (m.g.f)**:
#'
#' \deqn{
#' E(e^{tX}) = \Big(\frac{\beta}{ \beta - t}\Big)^{\alpha}, \thinspace t < \beta
#' }{
#' E(e^(tX)) = \Big(\frac{\beta}{ \beta - t}\Big)^{\alpha}, \thinspace t < \beta

Refactoring plotting

Where I'm at so far:

  • I think having plot_pdf() and plot_cdf() are valuable even if there is some code repetition because the names are the most clear, and beginners in my experience have struggled with function arguments (i.e. type = "cdf" vs type = "pdf"), and we can take away that potential snag via specialized functions names at little additional maintenance cost.

  • I don't think that having two distinct plotting systems is beginning friendly. I think it is important that there is one canonical way to do things in this package. I'm actually increasingly convinced aliasing plot() and autoplot() is going to cause beginner confusion.

  • I don't particularly mind a ggplot2 hard dependency.

pdf/cdf/quantile should error if x is outside of domain.

An error is thrown, but it is happening inside dbinom which is non-ideal. 0 is returned but should ideally be errored.

library(distributions)

pdf(Bernoulli(0.1), 0.5)
#> Warning in dbinom(x = x, size = 1, prob = d$p): non-integer x = 0.500000
#> [1] 0

Created on 2019-06-23 by the reprex package (v0.3.0)

Predicted sigma from lm/glm

In the new prodist() method we return for a model the probability distribution associated with the point estimates from that model. In case of the linear model (or the Gaussian generalized linear model) it is uncontroversial what the point estimate for the location parameter mu is. However, there are two commonly used flavors for sigma: Either based on the unbiased least squares estimate (division by n-k) or the biased maximum likelihood estimate (division by n).

In the prodist() methods for lm and glm objects I implemented the least squares estimate because that is what the summary() methods for both lm and glm objects report. However, I realized now, that somewhat inconsistently the logLik() methods for both objects use the maximum likelihood estimate. Hence, wondered:

  • Is the least squares estimate the right choice?
  • Should it be the maximum likelihood estimate instead?
  • Or should we make this optional, say prodist(..., sigma = "ML") vs. "OLS".
  • If the latter, what should be the default?

Alex @alexpghayes maybe you have an opinion on this? Or maybe someone else?

For illustration: In the summary() method we have the following.

m <- lm(dist ~ speed, data = cars)
c(summary(m)$sigma, summary(m)$sigma^2)
## [1]  15.37959 236.53169
summary(m)
## ...
## Residual standard error: 15.38 on 48 degrees of freedom
## [...]
summary(glm(dist ~ speed, data = cars))
## [...]
## (Dispersion parameter for gaussian family taken to be 236.5317)
## [...]

Hence, up to now we have:

pd <- prodist(m)
pd[1]
## "Normal distribution (mu = -1.849, sigma = 15.38)"

But with this we cannot replicate the log-likelihood:

logLik(m)
## 'log Lik.' -206.5784 (df=3)
log_likelihood(pd, cars$dist)
## [1] -206.599

To do so we would need the maximum likelihood estimate:

pd$sigma <- sqrt(mean(residuals(m)^2))
 pd[1]
## "Normal distribution (mu = -1.849, sigma = 15.07)" 
log_likelihood(pd, cars$dist)
## [1] -206.5784

Test failing for log_pdf of Generalised Pareto distribution

Hi,
I was running devtools::test() for #54 and noticed that a test for the Generalised Pareto distribution was failing on my system:

library(distributions3)
#> 
#> Attaching package: 'distributions3'
#> The following objects are masked from 'package:stats':
#> 
#>     Gamma, quantile
#> The following object is masked from 'package:grDevices':
#> 
#>     pdf

xi3 <- -1e-7
g3 <- GP(0, 1, xi3)
up <- -1 / xi3
xvec <- c(0, Inf, NA)
x3 <- c(up, xvec)

testthat::expect_equal(log_pdf(g3, x3), c(-Inf, 0, -Inf, NA))
#> Error: log_pdf(g3, x3) not equal to c(-Inf, 0, -Inf, NA).
#> 1/4 mismatches
#> [1] -1.5e+07 - -Inf == Inf

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       macOS Catalina 10.15.5      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Stockholm            
#>  date     2020-06-26                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package        * version date       lib source        
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
#>  backports        1.1.8   2020-06-17 [1] CRAN (R 4.0.0)
#>  bayesplot        1.7.2   2020-05-28 [1] CRAN (R 4.0.0)
#>  callr            3.4.3   2020-03-28 [1] CRAN (R 4.0.0)
#>  cli              2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
#>  colorspace       1.4-1   2019-03-18 [1] CRAN (R 4.0.0)
#>  crayon           1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
#>  desc             1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools         2.3.0   2020-04-10 [1] CRAN (R 4.0.0)
#>  digest           0.6.25  2020-02-23 [1] CRAN (R 4.0.0)
#>  distributions3 * 0.1.1   2020-06-26 [1] local         
#>  dplyr            1.0.0   2020-05-29 [1] CRAN (R 4.0.0)
#>  ellipsis         0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
#>  evaluate         0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi            0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
#>  fs               1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
#>  generics         0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
#>  ggplot2          3.3.2   2020-06-19 [1] CRAN (R 4.0.0)
#>  ggridges         0.5.2   2020-01-12 [1] CRAN (R 4.0.0)
#>  glue             1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
#>  gtable           0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
#>  highr            0.8     2019-03-20 [1] CRAN (R 4.0.0)
#>  htmltools        0.5.0   2020-06-16 [1] CRAN (R 4.0.0)
#>  knitr            1.29    2020-06-23 [1] CRAN (R 4.0.0)
#>  lifecycle        0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
#>  magrittr         1.5     2014-11-22 [1] CRAN (R 4.0.0)
#>  memoise          1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
#>  munsell          0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
#>  pillar           1.4.4   2020-05-05 [1] CRAN (R 4.0.0)
#>  pkgbuild         1.0.8   2020-05-07 [1] CRAN (R 4.0.0)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
#>  pkgload          1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
#>  plyr             1.8.6   2020-03-03 [1] CRAN (R 4.0.0)
#>  prettyunits      1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
#>  processx         3.4.2   2020-02-09 [1] CRAN (R 4.0.0)
#>  ps               1.3.3   2020-05-08 [1] CRAN (R 4.0.0)
#>  purrr            0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
#>  R6               2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
#>  Rcpp             1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
#>  remotes          2.1.1   2020-02-15 [1] CRAN (R 4.0.0)
#>  revdbayes        1.3.6   2019-12-02 [1] CRAN (R 4.0.0)
#>  rlang            0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
#>  rmarkdown        2.3     2020-06-18 [1] CRAN (R 4.0.0)
#>  rprojroot        1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
#>  scales           1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
#>  sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi          1.4.6   2020-02-17 [1] CRAN (R 4.0.0)
#>  stringr          1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat         2.3.2   2020-03-02 [1] CRAN (R 4.0.0)
#>  tibble           3.0.1   2020-04-20 [1] CRAN (R 4.0.0)
#>  tidyselect       1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
#>  usethis          1.6.1   2020-04-29 [1] CRAN (R 4.0.0)
#>  vctrs            0.3.1   2020-06-05 [1] CRAN (R 4.0.0)
#>  withr            2.2.0   2020-04-20 [1] CRAN (R 4.0.0)
#>  xfun             0.15    2020-06-21 [1] CRAN (R 4.0.0)
#>  yaml             2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] /Users/ellessenne/R-lib
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Created on 2020-06-26 by the reprex package (v0.3.0)

Multivariate distributions

  • Probably need their own subclass of distributions
  • Ugh. What should the return be? Traditionally R has return a matrix with no row or column names, which is really irritating if you want to put things into a tibble right after. I think a matrix with column names? cc @1danjordan @EmilHvitfeldt

Distributions to start with

  • Categorical
  • Multinomial
  • Multivariate Normal

Vignettes that might be useful for intro stat courses

One sample tests

  • Update the one sample z-test for a proportion vignette to include rejection regions and power calculations
    • Include note about rejection region not being inverted CI due to different standard errors
  • Update the one sample t-test vignette to include rejection regions and power calculations
  • One sample z confidence interval for a proportion
  • Chi squared tests for counts
    • Would be really great to get help with this one

Two sample tests

  • Two sample T-test (independent / equal variance)
    • Comment on $F = T^2$ relationship
  • Two sample T-test (Welch's)
    • Show use of power.t.test()

Paired tests

  • Paired Z-test
  • Paired T-test
  • Paired sign-test

ANOVA and linear regression

  • One-way ANOVA using the F distribution
    • Confirm results with both aov() and anova(lm())
    • Show power calculations with power.anova.test()
    • Diagnostic plots / assumption checking

Add is_discrete() method

So that we can factor it out of other code, in particular for plotting. Should use inherits() instead of current class(x)[1] approach to be robust to future changes to the class system.

Release distributions3 0.2.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Check if any deprecation processes should be advanced, as described in Gradual deprecation
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted πŸŽ‰
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

HyperGeometric issues

  • The support in the documentation is wrong
  • The HyperGeometric function allows for k > n + m

Rounding/nicer formatting for print methods

Down the road we might want to consider how to format numbers coming out of print methods for the distributions. In this example

#5 (comment)

we get something like

fit(normal(), rnorm(100))
#> normal distribution (mu = -0.0962394013206253, sigma = 1.09850673011021)

fit(exponential(), rexp(100))
#> Exponential distribution (rate = 0.827082214289625)

where I feel like displaying < 15 digits might be a little overwhelming, especially if we are intending for this to be accessible for new users.

Multinomial issues

  • print.Multinomial does not behave like we want it to
  • random.Multinomial, cdf.Multinomial, pdf.Multinomial, and log_pdf.Multinomial all have prob = d$size instead of prob = d$p

Bug: weird things happen when you pass multiple parameters to distributions

library(distributions)
#> 
#> Attaching package: 'distributions'
#> The following objects are masked from 'package:stats':
#> 
#>     Gamma, quantile
#> The following object is masked from 'package:grDevices':
#> 
#>     pdf
n <- Normal(1:10, 1)

n
#> Normal distribution (mu = 1, sigma = 1) Normal distribution (mu = 2, sigma = 1) Normal distribution (mu = 3, sigma = 1) Normal distribution (mu = 4, sigma = 1) Normal distribution (mu = 5, sigma = 1) Normal distribution (mu = 6, sigma = 1) Normal distribution (mu = 7, sigma = 1) Normal distribution (mu = 8, sigma = 1) Normal distribution (mu = 9, sigma = 1) Normal distribution (mu = 10, sigma = 1)

sample(n, 2)
#> $sigma
#> [1] 1
#> 
#> $mu
#>  [1]  1  2  3  4  5  6  7  8  9 10

Created on 2019-06-27 by the reprex package (v0.3.0)

Use Distributions.jl test reference file for verification of distributions3

Distributions.jl uses a reference file to verify correctness of their implementation. Surprisingly, they also have a practically complete R package in R6 they used to create this reference file. See the reference files for discrete distributions and continuous distributions.

We could leverage their reference file to use in verifying distributions3, in a similar manner to how they do so. This would require a bit of code to map each distribution in distributions3 to the entry in each JSON file. I think this would be more robust and (once written) accelerate development/introduction of new distributions because the test would essentially already be written!

Some distributions don't have default parameters

library(distributions)
#> 
#> Attaching package: 'distributions'
#> The following objects are masked from 'package:stats':
#> 
#>     Gamma, quantile
#> The following object is masked from 'package:grDevices':
#> 
#>     pdf
distributions::Bernoulli()
#> Bernoulli distribution (p = 0.5)
distributions::Beta()
#> Beta distribution (alpha = 1, beta = 1)
distributions::Binomial()
#> Error in distributions::Binomial(): argument "size" is missing, with no default
distributions::Cauchy()
#> Cauchy distribution (location = 0, scale = 1)
distributions::ChiSquare()
#> Error in distributions::ChiSquare(): argument "df" is missing, with no default
distributions::Exponential()
#> Exponential distribution (rate = 1)
distributions::FisherF()
#> Error in distributions::FisherF(): argument "df1" is missing, with no default
distributions::Gamma()
#> Error in distributions::Gamma(): argument "shape" is missing, with no default
distributions::Geometric()
#> Geometric distribution (p = 0.5)
distributions::HyperGeometric()
#> Error in distributions::HyperGeometric(): argument "m" is missing, with no default
distributions::Logistic()
#> Logistic distribution (location = 0, scale = 1)
distributions::LogNormal()
#> Lognormal distribution (log_mu = 0, log_sigma = 1)
distributions::Multinomial()
#> Error in distributions::Multinomial(): argument "size" is missing, with no default
distributions::NegativeBinomial()
#> Error in distributions::NegativeBinomial(): argument "size" is missing, with no default
distributions::Normal()
#> Normal distribution (mu = 0, sigma = 1)
distributions::Poisson()
#> Error in distributions::Poisson(): argument "lambda" is missing, with no default
distributions::StudentsT()
#> Error in distributions::StudentsT(): argument "df" is missing, with no default
distributions::Tukey()
#> Error in distributions::Tukey(): argument "nmeans" is missing, with no default
distributions::Uniform()
#> Continuous Uniform distribution (a = 0, b = 1)
distributions::Weibull()
#> Error in distributions::Weibull(): argument "shape" is missing, with no default

Created on 2019-06-27 by the reprex package (v0.3.0)

Is this intentional? do we want to provide a sensible default to ensure consistent behavior?

Generic functions for the moments

Similar generics exist in Distributions.jl

n <- Normal(5, 10)
mean(n)
variance(n)
skewness(n)
kurtosis(n)

Would be very easy to add as well.

prodist() followup

In #83 @zeileis implemented a generic prodist(). I have two followup items to discuss about the implementation:

  1. An informatively named alias for prodist(), possibly distributional_estimates() or extract_distributions() or something along these lines. Currently I like distributional_estimates() the most but do not love it.
  2. Documentation. In particular, my understanding is that prodist() extracts distributional estimates of various population estimands from model objects. I think it would be good to (a) distinguish between a distributional estimator of estimands like E[Y|X] and "the distribution of a data point" in the documentation, especially since this is something likely to confuse students, and (b) to clarify in the documentation for each prodist() method what exactly the estimand is. Currently it is very hard to track down what distributional estimators are estimating without reading the source for each method and having a solid grasp of the various predict() and forecast() functions used in the implementation.

quantile() masks stats::quantile()

This is no surprise, as it is even mentioned on the help page.

Problem: geom_qq_line() will not work while distributions3 is loaded. This is an issue if you (like me...) would like to use distributions3 and ggplot2 for teaching without having to touch the base R plotting functions, and are forced to cover QQ-plots... An ad-hoc fix is to use geom_abline() instead of geom_qq_line(), but it adds an unwanted and unnecessary extra level of complexity.

I have no fix for this, but think it would be worth it to find a way around it down the line, if possible.

Interest in extreme value distributions?

I recently learnt about the package, evd, which provides extreme value distributions - would you be interested in adding a few of the key distributions from this? E.g., FrΓ©chet and Gumbel?

plotting generic

I was thinking about the adding a plotting generic would be beneficial. My initial idea would be to include both pdf and cdf side by side.

library(distributions)
#> 
#> Attaching package: 'distributions'
#> The following objects are masked from 'package:stats':
#> 
#>     binomial, poisson, quantile
#> The following object is masked from 'package:grDevices':
#> 
#>     pdf
#> The following objects are masked from 'package:base':
#> 
#>     beta, gamma
library(glue)

plot.normal <- function(p) {
  range <- p$sigma * 4 * c(-1, 1) + p$mu
  x <- seq(range[1], range[2], length.out = 100)
  
  par(mfrow = c(1, 2)) 
  plot(x = x,
       y = pdf(p, x),
       type = "l", ylab = "Proberbility Density", 
       main = glue("normal pdf (mu = {p$mu}, sigma = {p$sigma})"))
  plot(x = x,
       y = cdf(p, x),
       type = "l", ylab = "Proberbility", 
       main = glue("normal cdf (mu = {p$mu}, sigma = {p$sigma})"))
  par(mfrow = c(1, 1)) 
}

plot(normal(2, 5))

Created on 2019-05-14 by the reprex package (v0.2.1)

Name conventions

  • Distribution as capital letters?
  • _stat / _obs suffix for observed values of statistics

Try to follow upper/lower-case convention to make things look as much like the math in textbooks as possible

Support continuous ranked probability score (CRPS)?

The continuous ranked probability score (CRPS) is a popular and strictly proper scoring rule for probabilistic models and forecasts which has some advantages over the so-called log-score (equivalent to the log-density aka log-likelihood). For example, it is bounded as the precision increases (i.e., variance decreases).

It would be nice to have crps() methods for the distribution objects in the package analogously to the log_pdf() methods. Implementation would be relatively straightforward using the nice scoringRules package (see also their JSS paper). For example, for the normal distribution:

crps.Normal <- function(y, x, drop = TRUE, elementwise = NULL, ...) {
  stopifnot(requireNamespace("scoringRules"))
  FUN <- function(at, d) scoringRules::crps_norm(y = at, mean = d$mu, sd = d$sigma)
  apply_dpqr(d = y, FUN = FUN, at = x, type = "crps", drop = drop, elementwise = elementwise)
}

It would be enough to include scoringRules in the "Suggests" as the S3 methods can be registered conditionally. scoringRules does not support all of the distributions in distributions3 but most of them. Should I go ahead and prepare a PR for this? (If you would prefer not to have it in distributions3 I could also put this into the topmodels package.)

Release distributions3 0.2.1

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted πŸŽ‰
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push

Release distributions3 0.1.2

Prepare for release:

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted πŸŽ‰
  • usethis::use_github_release()
  • usethis::use_dev_version()

pdf.bernoulli isn't vectorised

See below:

b <- bernoulli()
x <- c(0,0,0,1)
pdf(b, x)

#> [1] 0.5
#> Warning message:
#> In if (x == 0) 1 - d$p else d$p :
#>   the condition has length > 1 and only the first element will be used

Simple enough fix is to use dbinom with size = 1 instead. I can include this fix in my existing pull request.

Revised CRAN release

Are there plans for a CRAN release that will include some of the bug fixes that have been implemented? If it matters, I'm thinking of #51 specifically. I have a package, imaginator which uses distributions3 that I'm preparing for a CRAN update.

Figure out how to handle vector inputs to distribution constructors appropriately

Suppose you want to simulate from a logistic regression model. You calculate the linear predictor, and now you need n (number of observation) different Bernoulli objects, and want to draw a single sample from each of them. This is one place where the base R approach shines. Should we vectorize object construction? Then we lose type stability because the type of the output (distribution vs list) depends on the length of the input.

Feedback and possible solutions welcome.

Pretty print of distribution object

Hi Alex,

I wanted to thank you for this package, it's really well designed.

I wanted to know if it's possible to change the way we print distribution object.
For example, for the moment, we have this behavior:

> Binomial(1, 0.5)
Binomial distribution (size = 1, p = 0.5)> 

But I was expecting this

> Binomial(1, 0.5)
Binomial distribution (size = 1, p = 0.5)
> 

Is it a design choice ? Can we change it ? I can work on a PR if you think it's better ?
Thanks again for your work on this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.