jonasrieger / ldaprototype Goto Github PK

Determine a Prototype from a number of runs of Latent Dirichlet Allocation.

License: GNU General Public License v3.0

R 96.50% TeX 3.40% Shell 0.10%

topicmodeling topicmodelling lda topic-models topic-model topic-similarities text-mining textdata latent-dirichlet-allocation modelselection model-selection reliability

ldaprototype's Introduction

ldaPrototype

Prototype of Multiple Latent Dirichlet Allocation Runs

Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.

Citation

Please cite the JOSS paper using the BibTeX entry

@article{<placeholder>,
    title = {{ldaPrototype}: A method in {R} to get a Prototype of multiple Latent Dirichlet Allocations},
    author = {Jonas Rieger},
    journal = {Journal of Open Source Software},
    year = {2020},
    volume = {5},
    number = {51},
    pages = {2181},
    doi = {10.21105/joss.02181},
    url = {https://doi.org/10.21105/joss.02181}
  }

which is also obtained by the call citation("ldaPrototype").

References (related to the methodology)

Rieger, J., Jentsch, C. & Rahnenführer, J.: LDAPrototype: A Model Selection Algorithm to Improve Reliability of Latent Dirichlet Allocation. preprint
Rieger, J. (2020). ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations. Journal of Open Source Software, 5(51), 2181.
Rieger, J., Rahnenführer, J. & Jentsch, C. (2020). Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype. Natural Language Processing and Information Systems, NLDB 2020. LNCS 12089, pp. 118-125.

Please also have a look at this short overview on topic modeling in R:

Wiedemann, G. (2022). The World of Topic Modeling in R. M&K Medien & Kommunikationswissenschaft, 70(3), pp. 286-291.

Related Software

tm is useful for preprocessing text data.
lda offers a fast implementation of the Latent Dirichlet Allocation and is used by ldaPrototype.
quanteda is a framework for "Quantitative Analysis of Textual Data".
stm is a framework for Structural Topic Models.
tosca is a framework for statistical methods in content analysis including visualizations and validation techniques. It is also useful for managing and manipulating text data to a structure requested by ldaPrototype.
topicmodels is another framework for various topic models based on the Latent Dirichlet Allocation and Correlated Topics Models.
ldatuning is a framework for finding the optimal number of topics using various metrics.

Contribution

This R package is licensed under the GPLv3. For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, ...) and feature requests please use the issue tracker. Pull requests are welcome and will be included at the discretion of the author.

Installation

install.packages("ldaPrototype")

For the development version use devtools:

devtools::install_github("JonasRieger/ldaPrototype")

(Quick Start) Example

Load the package and the example dataset from Reuters consisting of 91 articles - tosca::LDAprep can be used to manipulate text data to the format requested by ldaPrototype.

library("ldaPrototype")
data(reuters_docs)
data(reuters_vocab)

Run the shortcut function to create a LDAPrototype object. It consists of the LDAPrototype of 4 LDA runs (with specified seeds) with 10 topics each. The LDA selected by the algorithm can be retrieved using getPrototype or getLDA.

res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab, n = 4, K = 10, seeds = 1:4)
proto = getPrototype(res) #= getLDA(res)

The same result can also be achieved by executing the following lines of code in several steps, which can be useful for interim evaluations.

reps = LDARep(docs = reuters_docs, vocab = reuters_vocab,
  n = 4, K = 10, seeds = 1:4)
topics = mergeTopics(reps, vocab = reuters_vocab)
jacc = jaccardTopics(topics)
sclop = SCLOP.pairwise(jacc)
res2 = getPrototype(reps, sclop = sclop)

proto2 = getPrototype(res2) #= getLDA(res2)

identical(res, res2)

There is also the option to use similarity measures other than the Jaccard coefficient. Currently, the measures cosine similarity (cosineTopics), Jensen-Shannon divergence (jsTopics) and rank-biased overlap (rboTopics) are implemented in addition to the standard Jaccard coefficient (jaccardTopics).

To get an overview of the workflow, the associated functions and getters for each type of object, the following call is helpful:

?`ldaPrototype-package`

(Slightly more detailed) Example

Similar to the quick start example, the shortcut of one single call is again compared with the step-by-step procedure. We model 5 LDAs with K = 12 topics, hyperparameters alpha = eta = 0.1 and seeds 1:5. We want to calculate the log likelihoods for the 20 iterations after 5 burn-in iterations and topic similarities should be based on atLeast = 3 words (see Step 3 below). In addition, we want to keep all interim calculations, which would be discarded by default to save memory space.

res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab,
  n = 5, K = 12, alpha = 0.1, eta = 0.1, compute.log.likelihood = TRUE,
  burnin = 5, num.iterations = 20, atLeast = 3, seeds = 1:5,
  keepLDAs = TRUE, keepSims = TRUE, keepTopics = TRUE)

Based on res we can have a look at several getter functions:

getID(res)
getPrototypeID(res)

getParam(res)
getParam(getLDA(res))

getLDA(res, all = TRUE)
getLDA(res)

est = getEstimators(getLDA(res))
est$phi[,1:3]
est$theta[,1:3]
getLog.likelihoods(getLDA(res))

getSCLOP(res)
getSimilarity(res)[1:5, 1:5]
tosca::topWords(getTopics(getLDA(res)), 5)

Step 1: LDA Replications

In the first step we simply run the LDA procedure five times with the given parameters. This can also be done with support of batchtools using LDABatch instead of LDARep or parallelMap setting the pm.backend and (optionally) ncpus argument(s).

reps = LDARep(docs = reuters_docs, vocab = reuters_vocab,
  n = 5, K = 12, alpha = 0.1, eta = 0.1, compute.log.likelihood = TRUE,
  burnin = 5, num.iterations = 20, seeds = 1:5)

Step 2: Merging Topic Matrices of Replications

The topic matrices of all replications are merged and reduced to the vocabulary given in vocab. By default the vocabulary of the first topic matrix is used as a simplification of the case that all LDAs contain the same vocabulary set.

topics = mergeTopics(reps, vocab = reuters_vocab)

Step 3: Topic Similarities

We use the merged topic matrix to calculate pairwise topic similarites using the Jaccard coefficient with parameters adjusting the consideration of words. A word is taken as relevant for a topic if its count passes thresholds given by limit.rel and limit.abs. A word is considered for calculation of similarities if it's relevant for the topic or if it belongs to the (atLeast =) 3 most common words in the corresponding topic. Alternatively, the similarities can also be calculated considering the cosine similarity (cosineTopics), Jensen-Shannon divergence (jsTopics - parameter epsilon to ensure computability) or rank-biased overlap (rboTopics - parameter k for maximum depth of evaluation and p as weighting parameter).

jacc = jaccardTopics(topics, limit.rel = 1/500, limit.abs = 10, atLeast = 3)
getSimilarity(jacc)[1:3, 1:3]

We can check the number of relevant and considered words using the ad-hoc getter. The difference between n1 and n2 can become larger than (atLeast =) 3 if there are ties in the count of words, which is negligible for large sample sizes.

n1 = getRelevantWords(jacc)
n2 = getConsideredWords(jacc)
(n2-n1)[n2-n1 != 0]

Step 3.1: Representation of Topic Similarities as Dendrogram

It is possible to represent the calulcated pairwise topic similarities as dendrogram using dendTopics and related plot options.

dend = dendTopics(jacc)
plot(dend)

The S-CLOP algorithm results in a pruning state of the dendrogram, which can be retrieved calling pruneSCLOP. By default each of the topics is colorized by its LDA run belonging; but the cluster belongings can also be visualized by the colors or by vertical lines with freely chosen parameters.

pruned = pruneSCLOP(dend)
plot(dend, pruned)
plot(dend, pruning = pruned, pruning.par = list(type = "both", lty = 1, lwd = 2, col = "red"))

Step 4: Pairwise LDA Model Similarities (S-CLOP)

For determination of the LDAPrototype the pairwise S-CLOP similarities of the 5 LDA runs are needed.

sclop = SCLOP.pairwise(jacc)

Step 5: Determine LDAPrototype

In the last step the LDAPrototype itself is determined by maximizing the mean pairwise S-CLOP per LDA.

res2 = getPrototype(reps, sclop = sclop)

There are several possibilites for using shortcut functions to summarize steps of the procedure. For example, we can determine the LDAPrototype after Step 1:

res3 = getPrototype(reps, atLeast = 3)

ldaprototype's People

Contributors

Stargazers

Watchers

Forkers

mfaymon

ldaprototype's Issues

Readme Update

Following the code in the ReadMe breaks down at Step 3.1 because sims isn't an object that has been completed. (by the way, I enjoyed the ReadMe it was nice that you highlighted the aggregate function and then broke down the components)

.Random.seed not found using the RNG L'Ecuyer-CMRG

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 
 
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ldaPrototype_0.3.0

loaded via a namespace (and not attached):
 [1] parallelMap_1.5.0 Rcpp_1.0.6        NLP_0.2-1         tosca_0.3-1       pillar_1.6.1      compiler_4.1.0    prettyunits_1.1.1
 [8] viridis_0.6.1     tools_4.1.0       progress_1.2.2    dendextend_1.15.1 lubridate_1.7.10  lifecycle_1.0.0   tibble_3.1.2     
[15] gtable_0.3.0      checkmate_2.0.0   viridisLite_0.4.0 pkgconfig_2.0.3   rlang_0.4.11      DBI_1.1.1         parallel_4.1.0   
[22] gridExtra_2.3     lda_1.4.2         xml2_1.3.2        dplyr_1.0.7       generics_0.1.0    vctrs_0.3.8       fs_1.5.0         
[29] hms_1.1.0         grid_4.1.0        tidyselect_1.1.1  glue_1.4.2        data.table_1.14.0 R6_2.5.0          fansi_0.5.0      
[36] ggplot2_3.3.4     purrr_0.3.4       magrittr_2.0.1    BBmisc_1.11       backports_1.2.1   scales_1.1.1      ellipsis_0.3.2   
[43] assertthat_0.2.1  colorspace_2.0-1  utf8_1.2.1        munsell_0.5.0     slam_0.1-48       tm_0.7-8          crayon_1.4.1

Under the above given setting the following error message appears when e.g. LDARep is executed:
Error in (function (fun, ..., more.args = list(), simplify = FALSE, use.names = FALSE, : object '.Random.seed' not found.

There is a workaround calling set.seed before. The function itself should actually take care of this case by calling the following code:

if (!exists(".Random.seed", envir = globalenv())) {
  runif(1)
}
oldseed = .Random.seed
seeds = sample(9999999, n)
.Random.seed <<- oldseed

I don't currently know exactly why this isn't working.

sprintf warning message using LDARep under Ubuntu

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 
 
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ldaPrototype_0.3.0

loaded via a namespace (and not attached):
 [1] parallelMap_1.5.0 Rcpp_1.0.6        NLP_0.2-1         tosca_0.3-1       pillar_1.6.1      compiler_4.1.0    prettyunits_1.1.1
 [8] viridis_0.6.1     tools_4.1.0       progress_1.2.2    dendextend_1.15.1 lubridate_1.7.10  lifecycle_1.0.0   tibble_3.1.2     
[15] gtable_0.3.0      checkmate_2.0.0   viridisLite_0.4.0 pkgconfig_2.0.3   rlang_0.4.11      DBI_1.1.1         parallel_4.1.0   
[22] gridExtra_2.3     lda_1.4.2         xml2_1.3.2        dplyr_1.0.7       generics_0.1.0    vctrs_0.3.8       fs_1.5.0         
[29] hms_1.1.0         grid_4.1.0        tidyselect_1.1.1  glue_1.4.2        data.table_1.14.0 R6_2.5.0          fansi_0.5.0      
[36] ggplot2_3.3.4     purrr_0.3.4       magrittr_2.0.1    BBmisc_1.11       backports_1.2.1   scales_1.1.1      ellipsis_0.3.2   
[43] assertthat_0.2.1  colorspace_2.0-1  utf8_1.2.1        munsell_0.5.0     slam_0.1-48       tm_0.7-8          crayon_1.4.1

Under the above given setting the following warning message appears when e.g. LDARep is executed:
1: In sprintf(...) : one argument not used by format 'Exporting objects to package env on master for mode: %s'

If you are running LDARep locally, however, the following message should appear:
Exporting objects to package env on master for mode: local

This is a warning resulting from parallelMap::parallelExport, explicitly from the line
showInfoMessage("Exporting objects to package env on master for mode: %s", mode, collapse(objnames)).
There is only one conversion specification %s, but two arguments, which results in the warning.

This will not be fixed, because the development of parallelMap is retired and this is an unwanted behavior that does not necessarily need to be corrected. Instead, the ldaPrototype package will replace parallelMap with the future package in the long run.

move parallelMap package from "Suggests" to "Imports" in DESCRIPTION file

{parallelMap} is called when running LDARep, a core function of the package. But because it is in "Suggests" it isn't installed by default on package install. So, if someone calls install.packages("ldaPrototype") and doesn't have {parallelMap} already installed, running LDARep or a function that calls it will result in an error.

Error in loadNamespace(name) : there is no package called ‘parallelMap’

I'd recommend moving parallelMap to Imports.

FWIW, this shouldn't impact the JOSS review IMO. But it would make the package more useful. (My first call to LDAPrototype resulted in the above error.)

Contribution guidelines (as opposed to code of conduct) are unclear

Related to JOSS review here

The community guidelines do not have a clear statement or instructions for contributing. Adding a sentence to the bottom of the README would fix that right up.

Paper Suggestions

In reference to the JOSS Review, a few paper suggestions

mallet is another package for estimating lda that might be mentioned along with lda and topicmodels.
Regarding the line: "A large part of the analysis is based on this model (LDA).” I don’t know what ’the analysis’ is referencing.
Regarding the line "Up to now, the so-called eye-balling method has been used in practice to select suitable results. From a set of models, subjective decisions are made to select the model that seems to fit the data best. This contradicts basically good scientific practice.” I think the more prominent technique has been selection by log-likelihood. I also don’t think your characterization of good scientific practice is uncontroversial here—there are tons of decisions which are essentially subjective decisions (includes of decisions of how to do measurement, what to investigate, or even whether to use LDA at all). It seems that someone who selects a model based on careful reading or fit for a particular/question of analysis would fall into bad scientific practice for your definition which feels unfair. I’ll note that many of the choices in your design are equally arbitrary (e.g. the similarity measure, the decision to essentially binarize a continuous measure, the thresholds, the way of finding the prototype). The advantage that your package allows is a kind of arbitrary transparency. It provides a procedure that makes a choice, effectively tying the analysts hands so that they can claim they didn’t search over results (of course, in practice they could just search over parameters of your function as well).

docs object expects all word frequencies to be 1 - transformation from dfm object (quanteda)

The docs object expects (for technical reasons) that all words occur with frequency 1. If words occur several times, they appear several times each with frequency 1.
In the quanteda package there are dfm objects that also allow values greater than 1. If you do your preprocessing in quanteda and want to use quanteda::dfm2lda to convert your object into the necessary structure, you need one more step to fulfill the requirements for the docs object. Just execute the following line:

docs = lapply(docs, function(x) rbind(rep(x[1,], x[2,]), 1))

This replicates words with multiple occurrences and protects you from the error message all(sapply(docs, function(x) all(x[2, ] == 1))) is not TRUE in LDARep and similar functions.

Failure with dev testthat

I see:

> test_check("ldaPrototype")
── Warning (test_LDABatch.R:146:3): is.LDABatch ────────────────────────────────
Parameter(s) num.iterations are duplicated. Take last one(s).

Killed

Can you please take a look? I'm planning to submit testthat to CRAN in about a month.

Error: (unknown) (@test_jaccardTopics.R#8): wrong sign in 'by' argument

Hi. When running revdep checks, ldaPrototype has produced the below error. I haven't looked at the code, so I don't know what N is but it looks like N-2 < x and that's unexpected. Maybe you're able to see how this could happen.

...
  The following object is masked from 'package:stats':
  
      cutree
  
  > 
  > test_check("ldaPrototype")
  ── 1. Error: (unknown) (@test_jaccardTopics.R#8)  ──────────────────────────────
  wrong sign in 'by' argument
  Backtrace:
   1. ldaPrototype::jaccardTopics(mtopics, pm.backend = "socket")
   2. ldaPrototype:::jaccardTopics.parallel(...)
   3. base::lapply(...)
   4. ldaPrototype:::FUN(X[[i]], ...)
   6. base::seq.default(x, N - 2, max(ncpus, 2))
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 243 | SKIPPED: 0 | WARNINGS: 2 | FAILED: 1 ]
  1. Error: (unknown) (@test_jaccardTopics.R#8) 
  
  Error: testthat unit tests failed
  Execution halted

Defaults and Documentation

Regarding the JOSS review, I'd recommend documenting the default parameters passed to lda.collapsed.gibbs.sampler in the help files for the functions that call it. This is particularly important for those which don't have defaults or have different defaults in the original package.

It is a stylistic choice but I'd also give some consideration to removing the default for K. Users rarely change defaults and I think a reason that other packages don't offer a default for K is a way of signaling that it is something that the user really has to engage with.

Add example analysis to README or Vignette

Relating to JOSS review here

There is no example usage of how to use the software for an analysis problem. A great place to put this would be in the README, showing basic usage. If you want to cover more ground than one would typically put in a README, a vignette is a good place. But without this, I'm not sure where to start and thus can't check functionality.