dm13450 / dirichletprocess Goto Github PK

Build dirichletprocess objects for data analysis

Home Page: https://dm13450.github.io/dirichletprocess/

R 61.68% TeX 38.32%

r bayesian r-package mcmc bayesian-inference bayesian-statistics dirichlet-process

dirichletprocess's Introduction

dirichletprocess

The dirichletprocess package provides tools for you to build custom Dirichlet process mixture models. You can use the pre-built Normal/Weibull/Beta distributions or create your own following the instructions in the vignette. In as little as four lines of code you can be modelling your data nonparametrically.

Installation

You can install the stable release of dirichletprocess from CRAN:

install.packages("dirichletprocess")

You can also install the development build of dirichletprocess from github with:

# install.packages("devtools")
devtools::install_github("dm13450/dirichletprocess")

For a full guide to the package and its capabilities please consult the vignette:

browseVignettes(package = "dirichletprocess")

Examples

Density Estimation

Dirichlet processes can be used for nonparametric density estimation.

faithfulTransformed <- faithful$waiting - mean(faithful$waiting)
faithfulTransformed <- faithfulTransformed/sd(faithful$waiting)
dp <- DirichletProcessGaussian(faithfulTransformed)
dp <- Fit(dp, 100, progressBar = FALSE)
plot(dp)

Clustering

Dirichlet processes can also be used to cluster data based on their common distribution parameters.

faithfulTrans <- scale(faithful)
dpCluster <-  DirichletProcessMvnormal(faithfulTrans)
dpCluster <- Fit(dpCluster, 2000, progressBar = FALSE)
plot(dpCluster)

For more detailed explanations and examples see the vignette.

Tutorials

I’ve written a number of tutorials:

and some case studies:

dirichletprocess's People

Contributors

Stargazers

Watchers

Forkers

keesmulder csetraynor mafrasiabi giovanni-sig fratelino omo03 filippofiocchi erikseulean russwest044 davidbuch liuyujian001107 sahandsydney cy0xz

dirichletprocess's Issues

Sampling Posterior

Please check the PosteriorDraw(...) function. The first input argument has issues.

Error in y[which(clusterLabels == i), , drop = FALSE] : subscript out of bounds

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
require(dirichletprocess)
require(ggplot2)
require(dplyr)
require(tidyr)
numIts <- 1000
faithfulTrans <- scale(faithful)
dp <- DirichletProcessMvnormal(faithfulTrans)
dp <- Fit(dp, numIts)
plot(dp)
dp2 <- DirichletProcessMvnormal(faithfulTrans, numInitialClusters = nrow(faithfulTrans))
dp2 <- Fit(dp2, numIts)
plot(dp2)

From the tutorial, bails with, Error in y[which(clusterLabels == i), , drop = FALSE] :
subscript out of bounds. It doesn't matter what value I set numInitialClusters to.

clustering high-dimensional data?

Hi @dm13450 I am trying to get dirichletprocess working for clustering a high-dimensional data set.
For example https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.train.gz has 256 features.
Using dirichletprocess::DirichletProcessMvnormal would result in a 256 x 256 covariance matrix per cluster, right?
This results in VERY SLOW inference on my computer.
One way to speed that up would be to use a constrained covariance matrix, say spherical.
Is that something that I should implement myself?
or is there some existing/recommended way to accomplish this?
Thanks

Problem with convergence of the likelihoodChain

Hello,
thank for the great package, it has been very usefull, easy to use and great to learn about DPMM.

I'm using your packages, but when I run more chains to cheack for convergence, even with relative simple data-set as Iris, the Gelman plot says : 'Cannot compute Gelman & Rubin's diagnostic for any chain segments for variables Likelihood. This indicates convergence failure'.

I think it'is due to the fact that some times the likelihoodChain estimate is -Inf. I try to run longer chain but it didn't solve the problem.

There is a way to solve this? For example, can I take out all the values in the chain with -Inf (I don't know if this is correct from a theory point of view) ?

Thank again for the great packages!!

The role of "n" in the PosteriorDrawer.exp function (in custom mixtures part)

I am a little confused about the role of "n" in the PosteriorDrawer.exp function. If n shows sample size, why should n be equal to 1? If n shows the number of posterior samples, why is it still 1?

Only symmetric kernels supported for proposals in Metropolis-Hastings algorithm

Thank you for this very nice package. I was looking into the accompanying vignette to implement a new Dirichlet process mixture with a non-conjugate prior and I stumbled upon the MhParameterProposal. function (the example uses a random walk with a folded normal distribution).

It seems that only symmetric kernels are supported since the code in metropolis_hastings.R does not include a Hastings ratio. While this may be intentional, I could not see it mentioned in the vignette, the documentation or the examples. Perhaps worth adding a note to this effect?

Suspected Inconsistency in PosteriorClusters function.

dirichletprocess/R/posterior_clusters.R

Line 22 in 9c1ad2a

if (!missing(ind)) {

In this if statement, the two code paths imply very different versions of pointsPerCluster. When ind is not specified, we get

pointsPerCluster <- dpobj$weightsChain[[ind]]

when ind is specified, we get

pointsPerCluster <- dpobj$pointsPerCluster

I believe the former is normalized while the latter is not. This becomes clearest when you compare the output of PosteriorClusters without an index, and with an index specified for the last sample

y <- rnorm(1000)
dp <- DirichletProcessGaussian(y)
dp <- Fit(dp, 50)
print(round(head(PosteriorClusters(dp)$weights), 3))
# [1] 0.981 0.014 0.003 0.001 0.000 0.000
print(round(head(PosteriorClusters(dp, 50)$weights), 3))
# [1] 0.311 0.000 0.000 0.000 0.068 0.036

What's the right thing to do here? In the first code path, should pointsPerCluster be set to weightsChain[[ind]] * N where N is the sample size?

wrong number of dimensions from LikelihoodDP()

Hi everybody,

first let me thank you for the very nice package!

In the LikelihoodDP() function code it is stated that the likelihoodValues object should be an "n x k matrix". However, while using the package, I noticed that this does not seem to be true. I used it with the following simple example, which I debugged line by line:

library(dirichletprocess)

set.seed(1406)
n <- 10
y <- rt(n, 3) + 2

g0Priors <- c(0, 1, 1, 1)
alphaPriors <- c(2, 4)

mdobj <- GaussianMixtureCreate(g0Priors)
dpobj <- DirichletProcessCreate(y, mdobj, alphaPriors)
dpobj <- Initialise(dpobj)

debugonce(Fit)
dp <- Fit(dpobj, 10)

My guess is that the Likelihood.normal() function uses a definition of the parameters that is not appropriate (their values are replicated according to some rules that do not appear to be the right ones).

Is there any chance that you check it and get back with a response? I would like to use the package for my research, but I need to be sure that it provides the right numbers.

Thanks in advance for your answer.

Sergio

Error in update_concentration ?

Hello!

Thanks for this package and code that help a lot in understanding the dpmm. I had a question, but in the meantime I found the answer and this issue is no longer legitimate.

The original question was:

In script 'update_concentration.R', line 22:
"new_alpha <- rgamma(1, postParams1, postParams2)"
The update is based on [1] that says (in sec. 2.2) that the gamma prior is parameterized with 'a' (shape) and 'b' (scale). However, the rgamma function is parameterized with shape and rate (=1/scale). Shouldn't line 22 be:
'new_alpha <- rgamma(1, shape = postParams1, scale = postParams2)' ?

The answer to that question is that there seems to be an error in the original paper, and the parameterization is indeed (shape, rate) (cf here) This issue can be deleted, sorry =/

[1] West, M. (1992). Hyperparameter estimation in Dirichlet process mixture models. ISDS Discussion Paper# 92-A03: Duke University.

Indian buffets

Hello,
first of all thanks for this great package.

I wonder if there are any plans to implement IBPs as well or if this already somewhere supported?

Cheers,
Simon

Limit on the post_draws in Initialise

Hi there,

A question when I trying to do my own mixture model.

dirichletprocess/R/initialise.R

Lines 45 to 53 in 9dc5808

 post_draws <- PosteriorDraw(dpObj$mixingDistribution, dpObj$data, 1000) 

 if (verbose) 

 cat(paste("Accept Ratio: ", 

 length(unique(c(post_draws[[1]])))/1000, 

 "\n")) 

 dpObj$clusterParameters <- list(post_draws[[1]][, , 1000, drop = FALSE], 

 post_draws[[2]][, , 1000, drop = FALSE])

I am doing a non-conjugate mixture so that means Initialise will try to do MetropolisHastings. On line 52 and 53, dpObj$clusterParameters takes the first two parameters drawn from post_draws. If I understand the function correctly, isn't it restricting the model to have only two parameters? I have quite a bit parameters so I kept hitting the wall until I pinned it down to these two lines.

Thanks.

dirichletprocess.rdb is corrupt

Error: package or namespace load failed for ‘dirichletprocess’ in get(method, envir = home):
lazy-load database '/home/nburghoorn/R/x86_64-pc-linux-gnu-library/3.6/dirichletprocess/R/dirichletprocess.rdb' is corrupt

is what I get when I:

library(dirichletprocess)

for both the stable and dev installations.
Restarting/refreshing my Rstudio session doesn't work.
What could be happening?

pre-allocation issue in PosteriorDraw.nonconjugate

Hi there,

Spotted this when going through the code.
In PosteriorDraw.nonconjugate there is a line as a pre-allocation of theta before the loop.

dirichletprocess/R/mixing_distribution_posterior_draw.R

Line 25 in eccaaec

theta <- vector("list", length(mh_result))

Then later the length of the loop is essentially the length of mh_result$parameter_samples instead of mh_result.

dirichletprocess/R/mixing_distribution_posterior_draw.R

Line 27 in eccaaec

for (i in seq_along(mh_result$parameter_samples)) {

It doesn't break anything since it's a list but I figured I should let you know.

plot's dimension in a clustering of multivariate case.

Hello, I am Rhyeu in South Korea.

First of all, thanks for this awsome package!
I want to employ bayesian clustering method to some data but I could not find a package to fit them, not this package.

So, I have been tested some functions in it. and I have had a question.

when I exploit this function in multivariate data's clustering, what dimensions are applied to draw a plot??

In dummy case to be concrete,
I have tried to use 'DirichletProcessMvnormal' and 'Fit' in Fisher's iris data.

which dimensions are used in this plot? first and second loadings in PCA??
or other dimension that function have found?

I have started to study nonparametric bayesian clustering...
I have already check 'Vignettes' and reference manual but I could not find the dimension's definition..
please pardon me if I skipped your work defining that....

the dummy code is as follow.
Thanks!

data(iris)

iris_scale = scale(iris[,1:4])
dpCluster = DirichletProcessMvnormal(iris_scale)
dpCluster = Fit(dpCluster, 1000, progressBar = FALSE) 

plot(dpCluster)

constrain clusters to have common parameters?

hi @dm13450 first of all thanks for the great JSS article / vignette about dirichletprocess, which is super helpful. I am using it for teaching a CS class about unsupervised learning algorithms this semester.
I especially like how in the vignette it explains how to implement your own mixture models (Poisson example).
However it was not clear whether or not it is possible to constraint a parameter to have a common value across clusters.
For example I would like to implement something similar to mclust::Mclust(modelNames="E") which enforces equal variance in univariate gaussian mixture models. Is that possible?
I see that Likelihood.normal is defined as dnorm(x, theta[[1]], theta[[2]]), and I would like to instead use dnorm(x, theta[[1]], common_variance_param), where common_variance_param is used for all clusters, and it is also inferred from the data.

Issue with the LikelihoodDP() function

I noticed that in the latest release the LikelihoodDP() function resizes the likelihoodValues object using the code

dim(likelihoodValues) <- c(nrow(dpobj$data), dpobj$numberClusters)

but this shuffles the values in a strange way. I think it should be replaced by something like

likelihoodValues <- matrix(likelihoodValues, nrow = dpobj$n, ncol = dpobj$numberClusters, byrow = TRUE)

Looking forward to know what do you think about it. Thanks!

Restart the chain from where it stops

Hi Dean,
I was wondering if in the package, there is a function for restart the fitting of a DPMM object after it stops. For example, I fit a DPMM with 3000 iterations, then if I want to run the chain longer, I restart the chain from where it stopped, so the iterations number 3000.

To do so create a new function Fit2, that is very similar to Fit but take as input an already runned chain:

Fit2<- function(dpObj,its_start ,its_finish, updatePrior = FALSE, progressBar = interactive()) {
  
  if (progressBar){
    pb <- txtProgressBar(min=its_start, max=its_finish, width=50, char="-", style=3)
  }
  
  alphaChain <- dpObj$alphaChain
  likelihoodChain <- dpObj$likelihoodChain
  weightsChain <- dpObj$weightsChain
  clusterParametersChain <- dpObj$clusterParametersChain
  priorParametersChain <-   dpObj$priorParametersChain 
  labelsChain <-   dpObj$labelsChain 
  
  iteration <- its_start : its_finish
  for (i in iteration) {
    
    alphaChain[i] <- dpObj$alpha
    weightsChain[[i]] <- dpObj$pointsPerCluster / dpObj$n
    clusterParametersChain[[i]] <- dpObj$clusterParameters
    priorParametersChain[[i]] <- dpObj$mixingDistribution$priorParameters
    labelsChain[[i]] <- dpObj$clusterLabels
    
    
    likelihoodChain[i] <- sum(log(LikelihoodDP(dpObj)))
    
    dpObj <- ClusterComponentUpdate(dpObj)
    dpObj <- ClusterParameterUpdate(dpObj)
    dpObj <- UpdateAlpha(dpObj)
    
    if (updatePrior) {
      dpObj$mixingDistribution <- PriorParametersUpdate(dpObj$mixingDistribution,
                                                        dpObj$clusterParameters)
    }
    if (progressBar){
      setTxtProgressBar(pb, i)
    }
  }
  
  dpObj$weights <- dpObj$pointsPerCluster / dpObj$n
  dpObj$alphaChain <- alphaChain
  dpObj$likelihoodChain <- likelihoodChain
  dpObj$weightsChain <- weightsChain
  dpObj$clusterParametersChain <- clusterParametersChain
  dpObj$priorParametersChain <- priorParametersChain
  dpObj$labelsChain <- labelsChain
  
  if (progressBar) {
    close(pb)
  }
  return(dpObj)
}

I think it only work for the Fit.default.
Let me know if there is another way already implemented in the package, thank you.

HDP Fit function returns an empty object with an error when accessed

I have tried to run the following code from the package

N <- 300
#Sample N random uniform U
U <- runif(N)
group1 <- matrix(nrow=N, ncol=2)
group2 <- matrix(nrow=N, ncol=2)
#Sampling from the mixture

m1 <- c(-2,-2)
m2 <- c(2,2)

for(i in 1:N){
if(U[i]<.3){
group1[i,] <- mvtnorm::rmvnorm(1,m1)
group2[i,] <- mvtnorm::rmvnorm(1,m1)
}else if(U[i]<0.7){
group1[i,] <- mvtnorm::rmvnorm(1,m2)
group2[i,] <- mvtnorm::rmvnorm(1,m1)
}else {
group1[i,] <- mvtnorm::rmvnorm(1,m2)
group2[i,] <- mvtnorm::rmvnorm(1,m2)
}
}

data_hdp <- list(group1, group2)

hdp_mvnorm <- DirichletProcessHierarchicalMvnormal2(dataList = data_hdp)

system.time(
hdp_mvnorm <- Fit(hdp_mvnorm, 500)
)
`

The algorithm takes some time to run with the following error from the returned object

hdp_mvnorm
Dirichlet process object run for 0 iterations.
Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length

Please let me know if this is something you can help with

Regards
Amine

Wrong number of dimensions in Likelihood.normal()

Hi there,

I have another question regarding the size of the 'theta' parameters in the Likelihood.normal() function: it appears that the object returned by that function is a vector, while shouldn't it be a matrix with columns equal to the number of clusters?

The same comment applies to Likelihood.beta().

Please, let me know if this makes sense. Thanks!

Initialization to one cluster with all data points.

Hello,

Thank you for the package it has been a great resource for me to learn about Dirichlet processes! :)

I have a quick question for you: If we initialized all the data points to one cluster, would it not be difficult for the data points to get out of that cluster given we are weighting probabilities of cluster assignments by the number of points in each cluster? Consequently, would maybe initializing at the singletons be useful?

Thanks again!
Matteo

prior choice for beta2

Hi,

I am trying to apply to real data the function dirichletprocess beta but I got an extrange result. My goal is to cluster a one dimensional vector of percentages therefore bounded to [0,1]. I use the dirichlet process beta but I got this strange spyke near the boundary of 0. Would you have any advice on how to fit it in another way:
Currently using
dpobj = DirichletProcessBeta(y, maxY = 1, g0Priors = c(2,150), mhStep = c(0.25, 0.25), hyperPriorParameters = c(1, 1/150))

dpFit = Fit(dpObj = dpobj, 2000, updatePrior = TRUE)
plot(dpFit)

Different number of clusters of two converged dp model objects trained on same dataset?

Hi @dm13450, I was using the dirichletprocess package twice on the same data set, I called the two objects dp, and dp2 (see below).
I checked convergence following your approach on your blog. All parameters seemed to converge with Gelman Rubin Diagnostics well below 1.1. The summary of the two objects shows a mean number of clusters of 5.26 and 5.22, respectively. However, when I print out the number of clusters per object, it shows that dp has 5 clusters (which I would expect from the average of 5.26), but it also shows that dp2 has only 2 clusters. Can you explain why that is?

print(dp)
Dirichlet process object run for 5000 iterations.

Mixing distribution mvnormal
Base measure parameters c(0, 0), c(1, 0, 0, 1), 2, 2
Alpha Prior parameters 2, 4
Conjugacy conjugate
Sample size 28

Mean number of clusters 5.26
Median alpha 0.84

dp2
Dirichlet process object run for 5000 iterations.

Mixing distribution mvnormal
Base measure parameters c(0, 0), c(1, 0, 0, 1), 2, 2
Alpha Prior parameters 2, 4
Conjugacy conjugate
Sample size 28

Mean number of clusters 5.22
Median alpha 0.83
dp$numberClusters
[1] 5
dp2$numberClusters
[1] 2`

Potential bug in Predictive.mvnormal()

Hello,

I'm struggling with the code for the Predictive.mvnormal() function. In particular, I was wondering why in the last if() statement that computes the gamma_contrib scalar you used twice 'prod(vapply(seq_along(d) ...))', where seq_along(d) is always equal to 1. Can you please check it?

Many thanks in advance for your response.

	post_draws <- PosteriorDraw(dpObj$mixingDistribution, dpObj$data, 1000)

	if (verbose)
	cat(paste("Accept Ratio: ",
	length(unique(c(post_draws[[1]])))/1000,
	"\n"))

	dpObj$clusterParameters <- list(post_draws[[1]][, , 1000, drop = FALSE],
	post_draws[[2]][, , 1000, drop = FALSE])

dm13450 / dirichletprocess Goto Github PK

dirichletprocess's Introduction

dirichletprocess

Installation

Examples

Density Estimation

Clustering

Tutorials

dirichletprocess's People

Contributors

Stargazers

Watchers

Forkers

dirichletprocess's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs