GithubHelp home page GithubHelp logo

agoldst / dfrtopics Goto Github PK

View Code? Open in Web Editor NEW
47.0 47.0 11.0 8.34 MB

An R package for exploring topic models of text

License: MIT License

R 84.89% Python 3.60% HTML 4.20% C++ 4.71% Shell 0.01% CSS 2.59%

dfrtopics's People

Contributors

agoldst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dfrtopics's Issues

Problem with updating package

Dear Andrew,

I am trying to update the package but its giving the following error.

devtools::install_github("agoldst/dfrtopics")
Downloading github repo agoldst/dfrtopics@master
Installing dfrtopics
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ
--no-save --no-restore CMD INSTALL
'/private/var/folders/08/74jhwt0s54gb6pfr3byyy1ww0000gn/T/RtmpyE0efn/devtoolsa4525bc153d3/agoldst-dfrtopics-2235503'
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library'
--install-tests

  • installing source package ‘dfrtopics’ ...
    ** libs
    xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
    ERROR: compilation failed for package ‘dfrtopics’
  • removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/dfrtopics’
  • restoring previous ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/dfrtopics’
    Error: Command failed (1)

What is the way of getting around it.

Regards,

Hamza

Topics share over time

Hello! I have Twitter data and I performed topic modeling with Mallet. I visualized topics over time with topic_series (but changed years to days). I got this:
image

Can I somehow limit y-axis (as all the plots are in the bottom part of y-axis) as my main goal is to visualize trends of topics over time?

Topics share over time

Hello! I have Twitter data and I performed topic modeling with Mallet. I visualized topics over time with topic_series (but changed years to days). I got this:
image
Can I somehow limit y-axis (as all the plots are in the bottom part of y-axis) as my main goal is to visualize trends of topics over time?

misaligned? fix my bugs for me kthxbye

In the align branch I have written draft code for aligning topics from different models (with the same number of topics but possibly nonidentical vocabularies), inspired by this paper. But because I (a) know nothing about clustering and (b) apparently prefer hacking something together to researching correct implementations, the resulting code may or may not yield helpful alignments. I'm putting the experiment online in case anyone with an interest wants to try it out and see what they can find / whether they can diagnose problems. Install with

devtools::install_github("agoldst/dfrtopics", ref="align")

Create a series of models in the package's idiom:

m1 <- train_model(..., n_topics=40)
m2 <- train_model(..., n_topics=40)
...

Then:

dst <- model_distances(list(m1, m2, m3), n_words=40, g=JS_divergence)
clustering <- align_topics(dst, threshold=0.50)

A little more detail in the help files. Parameters to tune are n_words (number of top words from each topic to consider), g the metric (Chuang et al. use cosine distance), threshold the distance after which clustering halts.

train_model works with mallet from CRAN, but not mallet from github

The latest version of the R mallet package on github causes some problems with the train_model function of R/model.R. The main issue seems to be that trainer$model isn't defined in the latest github version of the mallet code, but it is on the CRAN version.

To duplicate:

Remove existing mallet package, and re-install from github:

remove.packages("mallet")
library(devtools)
install_github("mimno/RMallet/mallet")

Create some instances and then try to train a model:

m <- train_model(instance, n_topics=50, n_iters=300, metadata=df, threads=4)

...and then the error:

Apr 26, 2016 11:41:36 AM cc.mallet.topics.ParallelTopicModel <init>
INFO: Mallet LDA: 50 topics, 6 topic bits, 111111 topic mask
Error in trainer$model : no field, method or inner class called 'model' 

I'm able to fix this by just using the version from CRAN:

remove.packages("mallet")
install.packages("mallet")

So this works for now, until the CRAN version gets updated to the latest github version.

Problem inferring topics on new docs using a saved model

Here is a dummied up script to test what seems to be a bug with inference in dfrtopics

options(java.parameters="-Xmx6g")
library(dfrtopics)
library(dplyr)

#first create some dummy data for repeatability. Read in moby dick from gutenberg. Since readlines breaks at the newline char we'll treat each newline as a new "text"

texts <- text_of_file <- readLines("http://www.gutenberg.org/files/2701/2701-0.txt")

#Now remove those pesky blanks

texts <- texts[-which(texts == "")]

#Grab 2000 random items for training and put into dataframe with proper colnames and some dummied id labels

training_docs <- data_frame(id = paste("Train", 1:2000, sep="_"), text = sample(texts, 2000))

#Now grab another 100 that we'll pretend are new documents for inference later on

inference_docs <- data_frame(id = paste("Test", 1:100, sep="_"), text = sample(texts, 100))

#Make an instance list for the training docs (for the sake of this demo, no stoplist)

training_ilist <- make_instances(training_docs)

#Train a topic model

m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)

#Now write the model to disk so we can load it later. Also write out the instance list, we're going to need it.

write_mallet_model(m, "DEMO_MODEL", save_instances = TRUE)

#Before we can infer the topical makeup of new files, we need a compatible instance list (aka use-pipe-from in mallet)

#For some reason, load_mallet_model_directory does not load the instance file that we saved above as part of the write_mallet_model . . . I'm not sure why?

#Interestingly, we can build an inferencer from the model before reloading it using load_mallet_model_directory, but it does not work after loading. in other words: this works correctly

inf <- inferencer(m)
inf

#But once we relaod the model from file, like this

m <- load_mallet_model_directory("DEMO_MODEL") #DEMO_MODEL = local path

#We can't create an inferencer
inf <- inferencer(m)
inf # returns NULL

#Hmm, that's weird. Imagine that we quit R and want to come back another day and load the model and do some inference on some new files. It looks like we cannot do that.

#But maybe there is another route. I saved the instance list, so perhaps I can read it in and then use it in conjunction with the compatible_instances(docs, instances) function

ilist <- read_instances("DEMO_MODEL/instances.mallet")
inference_ilist <- compatible_instances(inference_docs, ilist)

#Ok, so now we've got a loaded model from disk and a compatiable instance list. I should be able to infer topics on new docs. . .

inferred_m <- infer_topics(m, inference_ilist) # Tada!

#But no. . . .

#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, :
#RcallMethod: invalid object parameter

#According to the help file: m can be either a topic inferencer object from read_inferencer or inferencer or a mallet_model object. m is of the later type:

class(m)
[1] "mallet_model"

#So why the error?

#Let's try another route. rebuild the same model

m <- train_model(training_ilist, n_topics=10, n_iters=100, seed=1966)
m_inferencer <- inferencer(m)

#Save it to disk

write_inferencer(m_inferencer, "DEMO_MODEL/m_inferencer.mallet")

#Read the inference from the file

inf <- read_inferencer("DEMO_MODEL/m_inferencer.mallet")
test <- infer_topics(inf, inference_ilist)

#Ugh. same error again . . .
#Error in rJava::.jcall(m, "[D", "getSampledDistribution", inst, n_iterations, :
#RcallMethod: invalid object parameter

#What now?

How do I check my check?

In v0.2.2, I try to implement (the most basic version of) the posterior predictive checking for topic models described in Mimno and Blei, 2011. But verifying that the checking actually does what it ought to is...challenging for me. I'm opening this issue as a way for anyone who's interested (anyone?) to make suggestions or, better yet, to demonstrate that the mi_check and imi_check functions I've added either do or do not work correctly. These and associated functions have fairly detailed documentation in the package. The associated source is in R/mi.R, R/rmultinom_sparse.R, src/entropy.cpp, and src/multinom.cpp.

error with load_from_mallet_state

First, thank you for sharing all this code and writing such excellent walkthroughs, both here and in dfr-browser!

I have several mallet topic-state.gz files, generated before I started using your package, that I'm trying now to load with the package, in part to take advantage of export_browser_data(). But the load_from_mallet_state() function has been crashing around the point of saving a simple state.csv and loading it into memory.

Here's the error message I get when I call load_from_mallet_state(mstate, simplified_state=simple_state, instances_file=ilist) with absolute paths for mstate, simple_state, and ilist. (I found that bash-style filepath expansions, relative to ~, weren't being interpreted properly.)

Loading /{mypath}/{myfile}.csv to a big.matrix...
Done.
Error in bigtabulate(x, ccols = ccols, breaks = breaks, table = FALSE,  : 
  REAL() can only be applied to a 'numeric', not a 'logical'

Any idea what could be causing the mis-type? I see that the simplify_state() function calls Python; the relevant version on my Mac OS system is 2.7.10. But I checked in Terminal that simplify_state.py produces reasonable-looking, integer-based output, and it does, so I'm thinking the problem must be somewhere further down the rabbit-hole that starts with read_state_bigmem().

Any help would be greatly appreciated, and please let me know what other information from me might be useful!

Error: 'data_frame_' is not an exported object from 'namespace:dplyr'

Hi,

I am encountering an error message after running the train_model function.

Error: 'data_frame_' is not an exported object from 'namespace:dplyr'

Also, many of the parameters of the mallet_model object appear as NULL (e.g., top_words:NULL).

I will appreciate any help you could provide.

Kind regards,

Jaime

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.