Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Different number of clusters of two converged dp model objects trained on same dataset? about dirichletprocess HOT 3 OPEN

dm13450 commented on August 31, 2024

Different number of clusters of two converged dp model objects trained on same dataset?

from dirichletprocess.

Comments (3)

dm13450 commented on August 31, 2024

Hey, thanks for using the package.

dp$numberClusters is the number of clusters for the last iteration of the fitting process, so will vary, it's just one sample of the DP process. As its a Bayesian model, each data point will have a distribution of likely cluster assignment, rather than just one label.

Does that help?

from dirichletprocess.

wseis commented on August 31, 2024

Hi, thanks for your answer. So when I understood correctly, ``dp$numberClusters` refers only to the last iteration of the MCMC fit. Given that, how can I determine the most likely cluster label for a given data point? In your blog about "Dirichlet Process Cluster Probabilities," you use

numClusters <- dp$numberClusters
clusterLabelProbs <- matrix(nrow=nrow(faithfulTrans), ncol=numClusters + 1)

to determine the required number of columns. As you see in my case, this won't work since the number of clusters strongly varies between iterations.

Does it make sense that when I have n-iterations for my cluster labels contained in the dp$labelsChain, to extract always i-th element, for the i-th data point and use the most frequently occurring cluster assignment?

Something like:

library(purrr) # includes the map function to extract ith element from a list of lists
labels1 <- list()

for(i in 1:length(dp$clusterLabels)){
labels1[[as.character(i)]] <- as.numeric(rownames(as.matrix(sort(table(unlist(map(dp$labelsChain, i))),
decreasing = T)))[1])

unlist(labels1)

Moreover, I have a similar problem with the ClusterLabelPredict function.

When I apply it twice to the same newData and the same fitted dp-object I get varying results. I assume this is also because it is a random draw from the posterior. However, when I repeat the prediction several times and take the most frequent clusterLabel prediction is stabilized, something like:

## Generating prediction for first two rows of the training set
prediction1 <- replicate(1000, list(ClusterLabelPredict(dp, newData = dp$data[1:2,])))
prediction2 <- replicate(1000, list(ClusterLabelPredict(dp, newData = dp$data[1:2,])))

´## Extracting Frequencies for newData item 2`

t1 <- sort(table(unlist(map(map(prediction1, 1),2))), decreasing = T)[1]
t2 <- sort(table(unlist(map(map(prediction2, 1),2))), decreasing = T)[1]

Or is there any more convenient way to extract the most likely cluster label?

from dirichletprocess.

dm13450 commented on August 31, 2024

Yeah everything above is correct, you ClusterLabelPredict is just one sample from the posterior and therefore need to do multiple draws, like you have.

Unfortunately, nothing more convenient, but your code looks fine.

from dirichletprocess.

Different number of clusters of two converged dp model objects trained on same dataset? about dirichletprocess HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs