GithubHelp home page GithubHelp logo

contefranz / optop Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 0.0 335 KB

Optimal topic identification from a pool of Latent Dirichlet Allocation models

R 77.58% C++ 22.25% C 0.17%
latent-dirichlet-allocation lda model-selection natural-language-processing nlp text-mining topic-modeling

optop's Introduction

Hi, I am Francesco! ๐Ÿ‘‹

I am an Assistant Professor of Accounting Analytics and Data Science at the Department of Accounting at Bocconi University.

I am also a fellow at the Bocconi Institute for Data Science and Analyics.

I try to understand what companies (and managers) write in their reports using Natural Language Processing techniques.

optop's People

Contributors

contefranz avatar mattia- avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

optop's Issues

Check compilation issues under MacOS

Using devtools::load_all(".") raises the following error:

##> Error: Could not find tools necessary to compile a package
##> Call `pkgbuild::check_build_tools(debug = TRUE)` to diagnose the problem.

I solved by invoking options(buildtools.check = function(action) TRUE ) before loading the functions or building the package from scratch.

To do: check if this is a one time issue or if this is consistent at each start up.

R session used:

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.27     assertthat_0.2.1  magrittr_2.0.1    reprex_1.0.0     
#>  [5] evaluate_0.14     highr_0.8         stringi_1.5.3     rlang_0.4.10     
#>  [9] cli_2.3.1         rstudioapi_0.13   fs_1.5.0          rmarkdown_2.7    
#> [13] tools_4.0.4       stringr_1.4.0     glue_1.4.2        xfun_0.22        
#> [17] yaml_2.2.1        compiler_4.0.4    htmltools_0.5.1.1 knitr_1.31

Created on 2021-03-27 by the reprex package (v1.0.0)

UPDATE

It appears to be a bug in RStudio, but folks there do not seem to be investing a lot of effort. This is a good issue to monitor here.

A temporary workaround which seems to be working, for me at least, most of the times is to do devtools::load_all(".") first to initiate all the compilation and function loadings. After that, standard RStudio shortcuts for building and checking seem to work.

Remove comments referring to old code

After all the conversions are done and tested, remove all the comments referring to old chunks of code which are no more applicable to the status quo. This is an issue that involves most of the functions.

Move topic_match() to internal

The function topic_match() is a utility function that returns a list with two data.table containing the informative and uninformative components. These are use by the agg_document_stability() function only.

I have a couple of suggestions:

  1. Move topic_match() to the utility function section and do not export it to the user. The designated functions will call it internally. This is to avoid confusion since there is really nothing else to do once the informative and uninformative components are computed.
  2. Convert topic_match() to C++. Even if this is "just" a utility function, its cost can't be neglected.

R CMD check for CRAN submission

Different machines present different WARNINGS and NOTES after running R CMD check --as-cran.

@mattia- Can you check if you have missing packages that could conflict with the check?
This is the R session I am using for you to compare.

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.27     assertthat_0.2.1  magrittr_2.0.1    reprex_1.0.0     
#>  [5] evaluate_0.14     highr_0.8         stringi_1.5.3     rlang_0.4.10     
#>  [9] cli_2.3.1         rstudioapi_0.13   fs_1.5.0          rmarkdown_2.7    
#> [13] tools_4.0.4       stringr_1.4.0     glue_1.4.2        xfun_0.22        
#> [17] yaml_2.2.1        compiler_4.0.4    htmltools_0.5.1.1 knitr_1.31

Created on 2021-03-27 by the reprex package (v1.0.0)

Final C++ conversions

Port topic_stability(), agg_topic_stability() and agg_document_stability() inner computations to C++

Loop over remaining documents in optimal_topic() goes out of bound

When optimal_topics() finds no perfect match between weighted_dfm and lda_models because the LDA was not able to estimate the model over those documents, the function removes the entries from weighted_dfm. We thought we were updating the same object but apparently we are not doing that. For this reason, when optimal_topic_core() loops over the documents in a given element of lda_models, it goes out of bound.

Could it be that we do not carry the information of the removed documents from R to C++?
This needs to be solved because it prevents the release of v1.0.0. I am labeling this as a bug then.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.