agoldst / dfrtopics Goto Github PK
View Code? Open in Web Editor NEWAn R package for exploring topic models of text
License: MIT License
An R package for exploring topic models of text
License: MIT License
Dear Andrew,
I am trying to update the package but its giving the following error.
devtools::install_github("agoldst/dfrtopics")
Downloading github repo agoldst/dfrtopics@master
Installing dfrtopics
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ
--no-save --no-restore CMD INSTALL
'/private/var/folders/08/74jhwt0s54gb6pfr3byyy1ww0000gn/T/RtmpyE0efn/devtoolsa4525bc153d3/agoldst-dfrtopics-2235503'
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library'
--install-tests
What is the way of getting around it.
Regards,
Hamza
I am following "an dfrtopics introduction" (http://agoldst.github.io/dfrtopics/introduction.html)
counts <- semi_join(counts, meta %>% select(id, pubdate) %>% filter(year(pubdate) != 1995), by="id")
This returns an error:
Error: 'id' column not found in lhs, cannot join
In the align
branch I have written draft code for aligning topics from different models (with the same number of topics but possibly nonidentical vocabularies), inspired by this paper. But because I (a) know nothing about clustering and (b) apparently prefer hacking something together to researching correct implementations, the resulting code may or may not yield helpful alignments. I'm putting the experiment online in case anyone with an interest wants to try it out and see what they can find / whether they can diagnose problems. Install with
devtools::install_github("agoldst/dfrtopics", ref="align")
Create a series of models in the package's idiom:
m1 <- train_model(..., n_topics=40)
m2 <- train_model(..., n_topics=40)
...
Then:
dst <- model_distances(list(m1, m2, m3), n_words=40, g=JS_divergence)
clustering <- align_topics(dst, threshold=0.50)
A little more detail in the help files. Parameters to tune are n_words
(number of top words from each topic to consider), g
the metric (Chuang et al. use cosine distance), threshold
the distance after which clustering halts.
The latest version of the R mallet package on github causes some problems with the train_model
function of R/model.R
. The main issue seems to be that trainer$model
isn't defined in the latest github version of the mallet code, but it is on the CRAN version.
To duplicate:
Remove existing mallet package, and re-install from github:
remove.packages("mallet")
library(devtools)
install_github("mimno/RMallet/mallet")
Create some instances and then try to train a model:
m <- train_model(instance, n_topics=50, n_iters=300, metadata=df, threads=4)
...and then the error:
Apr 26, 2016 11:41:36 AM cc.mallet.topics.ParallelTopicModel <init>
INFO: Mallet LDA: 50 topics, 6 topic bits, 111111 topic mask
Error in trainer$model : no field, method or inner class called 'model'
I'm able to fix this by just using the version from CRAN:
remove.packages("mallet")
install.packages("mallet")
So this works for now, until the CRAN version gets updated to the latest github version.
Here is a dummied up script to test what seems to be a bug with inference in dfrtopics
options(java.parameters="-Xmx6g")
library(dfrtopics)
library(dplyr)
#first create some dummy data for repeatability. Read in moby dick from gutenberg. Since readlines breaks at the newline char we'll treat each newline as a new "text"
texts <- text_of_file <- readLines("http://www.gutenberg.org/files/2701/2701-0.txt")
#Now remove those pesky blanks
texts <- texts[-which(texts == "")]
#Grab 2000 random items for training and put into dataframe with proper colnames and some dummied id labels
training_docs <- data_frame(id = paste("Train", 1:2000, sep="_"), text = sample(texts, 2000))
#Now grab another 100 that we'll pretend are new documents for inference later on
inference_docs <- data_frame(id = paste("Test", 1:100, sep="_"), text = sample(texts, 100))
#Make an instance list for the training docs (for the sake of this demo, no stoplist)
training_ilist <- make_instances(training_docs)
#Train a topic model
m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)
#Now write the model to disk so we can load it later. Also write out the instance list, we're going to need it.
write_mallet_model(m, "DEMO_MODEL", save_instances = TRUE)
#Before we can infer the topical makeup of new files, we need a compatible instance list (aka use-pipe-from in mallet)
#For some reason, load_mallet_model_directory does not load the instance file that we saved above as part of the write_mallet_model . . . I'm not sure why?
#Interestingly, we can build an inferencer from the model before reloading it using load_mallet_model_directory, but it does not work after loading. in other words: this works correctly
inf <- inferencer(m)
inf
#But once we relaod the model from file, like this
m <- load_mallet_model_directory("DEMO_MODEL") #DEMO_MODEL = local path
#We can't create an inferencer
inf <- inferencer(m)
inf # returns NULL
#Hmm, that's weird. Imagine that we quit R and want to come back another day and load the model and do some inference on some new files. It looks like we cannot do that.
#But maybe there is another route. I saved the instance list, so perhaps I can read it in and then use it in conjunction with the compatible_instances(docs, instances) function
ilist <- read_instances("DEMO_MODEL/instances.mallet")
inference_ilist <- compatible_instances(inference_docs, ilist)
#Ok, so now we've got a loaded model from disk and a compatiable instance list. I should be able to infer topics on new docs. . .
inferred_m <- infer_topics(m, inference_ilist) # Tada!
#But no. . . .
#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, :
#RcallMethod: invalid object parameter
#According to the help file: m can be either a topic inferencer object from read_inferencer or inferencer or a mallet_model object. m is of the later type:
class(m)
[1] "mallet_model"
#So why the error?
#Let's try another route. rebuild the same model
m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)
m_inferencer <- inferencer(m)
#Save it to disk
write_inferencer(m_inferencer, "DEMO_MODEL/m_inferencer.mallet")
#Read the inference from the file
inf <- read_inferencer("DEMO_MODEL/m_inferencer.mallet")
test <- infer_topics(inf, inference_ilist)
#Ugh. same error again . . .
#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, :
#RcallMethod: invalid object parameter
#What now?
In v0.2.2, I try to implement (the most basic version of) the posterior predictive checking for topic models described in Mimno and Blei, 2011. But verifying that the checking actually does what it ought to is...challenging for me. I'm opening this issue as a way for anyone who's interested (anyone?) to make suggestions or, better yet, to demonstrate that the mi_check
and imi_check
functions I've added either do or do not work correctly. These and associated functions have fairly detailed documentation in the package. The associated source is in R/mi.R
, R/rmultinom_sparse.R
, src/entropy.cpp
, and src/multinom.cpp
.
First, thank you for sharing all this code and writing such excellent walkthroughs, both here and in dfr-browser!
I have several mallet topic-state.gz
files, generated before I started using your package, that I'm trying now to load with the package, in part to take advantage of export_browser_data()
. But the load_from_mallet_state()
function has been crashing around the point of saving a simple state.csv and loading it into memory.
Here's the error message I get when I call load_from_mallet_state(mstate, simplified_state=simple_state, instances_file=ilist)
with absolute paths for mstate
, simple_state
, and ilist
. (I found that bash-style filepath expansions, relative to ~, weren't being interpreted properly.)
Loading /{mypath}/{myfile}.csv to a big.matrix... Done. Error in bigtabulate(x, ccols = ccols, breaks = breaks, table = FALSE, : REAL() can only be applied to a 'numeric', not a 'logical'
Any idea what could be causing the mis-type? I see that the simplify_state() function calls Python; the relevant version on my Mac OS system is 2.7.10. But I checked in Terminal that simplify_state.py
produces reasonable-looking, integer-based output, and it does, so I'm thinking the problem must be somewhere further down the rabbit-hole that starts with read_state_bigmem()
.
Any help would be greatly appreciated, and please let me know what other information from me might be useful!
Hi,
I am encountering an error message after running the train_model function.
Error: 'data_frame_' is not an exported object from 'namespace:dplyr'
Also, many of the parameters of the mallet_model object appear as NULL (e.g., top_words:NULL).
I will appreciate any help you could provide.
Kind regards,
Jaime
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.