bmschmidt / wordvectors Goto Github PK

View Code? Open in Web Editor NEW

280.0 30.0 78.0 2.5 MB

An R package for creating and exploring word2vec and other word embedding models

License: Other

R 56.56% C 43.44%

wordvectors's Introduction

Word Vectors

An R package for building and exploring word embedding models.

Description

This package does three major things to make it easier to work with word2vec and other vectorspace models of language.

Trains word2vec models using an extended Jian Li's word2vec code; reads and writes the binary word2vec format so that you can import pre-trained models such as Google's; and provides tools for reading only part of a model (rows or columns) so you can explore a model in memory-limited situations.
Creates a new VectorSpaceModel class in R that gives a better syntax for exploring a word2vec or GloVe model than native matrix methods. For example, instead of writing

model[rownames(model)=="king",],

you can write

model[["king"]],

and instead of writing

vectors %>% closest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",]) (whew!),

you can write

vectors %>% closest_to(~"king" - "man" + "woman").
Implements several basic matrix operations that are useful in exploring word embedding models including cosine similarity, nearest neighbor, and vector projection with some caching that makes them much faster than the simplest implementations.

Quick start

For a step-by-step interactive demo that includes installation and training a model on 77 historical cookbooks from Michigan State University, see the introductory vignette..

Credit

This includes an altered version of Tomas Mikolov's original C code for word2vec; those wrappers were origally written by Jian Li, and I've only tweaked them a little. Several other users have improved that code since I posted it here.

Right now, it does not (I don't think) install under Windows 8. Help appreciated on that thread. OS X, Windows 7, Windows 10, and Linux install perfectly well, with one or two exceptions.

It's not extremely fast, but once the data is loaded in most operations happen in suitable time for exploratory data analysis (under a second on my laptop.)

For high-performance analysis of models, C or python's numpy/gensim will likely be better than this package, in part because R doesn't have support for single-precision floats. The goal of this package is to facilitate clear code and exploratory data analysis of models.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Creating text vectors.

One portion of this is an expanded version of the code from Jian Li's word2vec package with a few additional parameters enabled as the function train_word2vec.

The input must still be in a single file and pre-tokenized, but it uses the existing word2vec C code. For online data processing, I like the gensim python implementation, but I don't plan to link that to R.

In RStudio I've noticed that this appears to hang, but if you check processors it actually still runs. Try it on smaller portions first, and then let it take time: the training function can take hours for tens of thousands of books.

VectorSpaceModel object

The package loads in the word2vec binary format with the format read.vectors into a new object called a "VectorSpaceModel" object. It's a light superclass of the standard R matrix object. Anything you can do with matrices, you can do with VectorSpaceModel objects.

It has a few convenience functions as well.

Faster Access to text vectors

The rownames of a VectorSpaceModel object are presumed to be tokens in a vector space model and therefore semantically useful. The classic word2vec demonstration is that vector('king') - vector('man') + vector('woman') =~ vector('queen'). With a standard matrix, the vector on the right-hand side of the equation would be described as

vector_set[rownames(vector_set)=="king",] - vector_set[rownames(vector_set)=="man",] + vector_set[rownames(vector_set)=="woman",]

In this package, you can simply access it by using the double brace operators:

vector_set[["king"]] - vector_set[["man"]] + vector_set[["woman"]]

(And in the context of the custom functions, as a formula like ~"king" - "man" + "woman": see below).

Since frequently an average of two vectors provides a better indication, multiple words can be collapsed into a single vector by specifying multiple labels. For example, this may provide a slightly better gender vector:

vector_set[["king"]] - vector_set[[c("man","men")]] + vector_set[[c("woman","women")]]

Sometimes you want to subset without averaging. You can do this with the argument average==FALSE to the subset. This is particularly useful for comparing slices of the matrix to itself in similarity operations.

cosineSimilarity(vector_set[[c("man","men","king"),average=F]], vector_set[[c("woman","women","queen"),average=F]]

A few native functions defined on the VectorSpaceModel object.

The native show method just prints the dimensions; the native plot method does some crazy reductions with the T-SNE package (installation required for functionality) because T-SNE is a nice way to reduce down the size of vectors, or lets you pass method='pca' to array a full set or subset by the first two principal components.

Useful matrix operations

One challenge of vector-space models of texts is that it takes some basic matrix multiplication functions to make them dance around in an entertaining way.

This package bundles the ones I think are the most useful. Each takes a VectorSpaceModel as its first argument. Sometimes, it's appropriate for the VSM to be your entire data set; other times, it's sensible to limit it to just one or a few vectors. Where appropriate, the functions can also take vectors or matrices as inputs.

cosineSimilarity(VSM_1,VSM_2) calculates the cosine similarity of every vector in on vector space model to every vector in another. This is n^2 complexity. With a vocabulary size of 20,000 or so, it can be reasonable to compare an entire set to itself; or you can compare a larger set to a smaller one to search for particular terms of interest.
cosineDistance(VSM_1,VSM_2) is the inverse of cosineSimilarity. It's not really a distance metric, but can be used as one for clustering and the like.
closest_to(VSM,vector,n) wraps a particularly common use case for cosineSimilarity, of finding the top n terms in a VectorSpaceModel closest to term m
project(VSM,vector) takes a VectorSpaceModel and returns the portion parallel to the vector vector.
reject(VSM,vector) is the inverse of project; it takes a VectorSpaceModel and returns the portion orthogonal to the vector vector. This makes it possible, for example, to collapse a vector space by removing certain distinctions of meaning.
magnitudes calculated the magnitude of each element in a VSM. This is useful in many operations.

All of these functions place the VSM object as the first argument. This makes it easy to chain together operations using the magrittr package. For example, beginning with a single vector set one could find the nearest words in a set to a version of the vector for "bank" that has been decomposed to remove any semantic similarity to the banking sector.

library(magrittr)
not_that_kind_of_bank = chronam_vectors[["bank"]] %>%
      reject(chronam_vectors[["cashier"]]) %>% 
      reject(chronam_vectors[["depositors"]]) %>%   
      reject(chronam_vectors[["check"]])
chronam_vectors %>% closest_to(not_that_kind_of_bank)

These functions also allow an additional layer of syntactic sugar when working with word vectors.

Or even just as a formula, if you're working entirely with a single model, so you don't have to keep referring to words; instead, you can use a formula interface to reduce typing and increase clarity.

vectors %>% closest_to(~ "king" - "man" + "woman")

Quick start

Install the wordVectors package.

One of the major hurdles to running word2vec for ordinary people is that it requires compiling a C program. For many people, it may be easier to install it in R.

If you haven't already, install R and then install RStudio.
Open R, and get a command-line prompt (the thing with a > on the left hand side.) This is where you'll be copy-pasting commands.
Install (if you don't already have it) the package devtools by pasting the following
```
install.packages("devtools")
```
Install the latest version of this package from Github by pasting in the following.
```
devtools::install_github("bmschmidt/wordVectors")
```
Windows users may need to install "Rtools" as well: if so, a message to this effect should appear in red on the screen. This may cycle through a very large number of warnings: so long as it says "warning" and not "error", you're probably OK.

Train a model.

For instructions on training, see the introductory vignette

Explore an existing model.

For instructions on exploration, see the end of the introductory vignette, or the slower-paced vignette on exploration

wordvectors's People

Contributors

Stargazers

Watchers

wordvectors's Issues

Allow variables in formulas

For testing quantities, it would be nice to allow variables.

Currently, this code fails because Error in tree[[1]] : object of type 'symbol' is not subsettable. Looks like a parse error, not a namespace one, so should be an easy fix.

dist = 1.2
form = ~ "king" + dist * ("woman" - "man")
glove %>% closest_to(form)

Chinese is not available

The format of my corpus is:

Use the following procedure for training：
model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)

result is:

error in scan

Hi Ben,

I accidentally reinstalled wordVectors, and now, on a file that worked fine before, I get this kind of error:

  scan() expected 'a real', got 'm��9&x�9�����&e����8x0S9���8�_l7m�{�*��9DD��ܲ�86�R��҅�c�g�ƒ��0O49�|S9P�O9�V�8���8y鄹G(�����9/݄9��X9��(�z[�9!i'

In addition: Warning messages:

1: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 4 appears to contain embedded nulls

2: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 5 appears to contain embedded nulls```

...so I was thinking that maybe it was an encoding issue, but file is in UTF-8, all seems tickety-boo. Wondered what you thought.

Delimiter of text context

Hi,
thanks for the library. I would like ask you if it's possible defined some delimited for sliding window. I want for example add each sentence on new line and as a context for each of word use words on same row.

Example:
I have a dream.
Cat eats a hotdog.

So I wold like have context of dream just {I, have, a} not Cat...

It's possible to do that?

Error in train_word2vec: "Error in if (binary) { : argument is of length zero"

I'm trying to train a model, and ran this in R using a plain text file* that I'd prepared using prep_word2vec:
girl.model <- train_word2vec("girls.txt", output_file = "~/MOC_Project/Data/girl_vectors", vectors = 300, threads = 2, window = 12, classes = 0, min_count = 5, iter = 5, force = TRUE, negative_samples = 5)
I got these messages, which seemed OK:

Starting training using file /Users/ella/Desktop/Capstone/CSStuff/MOC_Project/Scripts/girls.txt
Vocab size: 18998
Words in train file: 1006509

And then, after about a minute, I got this error:

Error in if (binary) { : argument is of length zero

I have no idea what caused this, but it seems like I can't train the model.

*Of a bunch of old social media profiles, which probably isn't relevant

plot function not working

After training a model, I get:

Vocab size (unigrams + bigrams): 4655
Words in train file: 14208
Starting training using file temp.prep
Vocab size: 343
Words in train file: 4333

But when I try to use plot, I get this error:

> plot(model)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  'x' and 'y' lengths differ
In addition: Warning message:
In matrix(nextup, ncol = j) :
  data length [343] is not a sub-multiple or multiple of the number of rows [172]

Any ideas would be appreciated. Thanks!

Does it accept Arabic (or any non-ASCII) in general?

I'm facing difficulty in vectorizing an Arabic text, I don't seem to be able of getting anything useful.

The word2vec function is only extracting funny characters (like emojis and so on) from a text file of about 200k Arabic words.. it seems also to convert these characters to codepoint values.

I would like to have nice an normal looking word2vec for my Arabic text.

Any comments or workarounds?

This package is not available for 3.5.

As title, this package cannot be installed in R 3.5 or above.

Faster version of function prep_word2vec

As you note in the README, prep_word2vec is slow. But it seems to be unnecessarily slow because it uses some base R functions and has to do some splitting of long lines. If you are willing to take a dependency on the tokenizers package (and thus stringi, which you already check for) then the function can probably go quite a bit faster. Here is a sketch of a function that takes roughly a tenth of the time on the cookbook corpus. (On my machine, 52.7 seconds for prep_word2vec and 5.8 seconds for prep_word2vec_alt. This function does not include the bundle_ngrams option. I can write this up properly, including bundle_ngrams, and send it as PR if you are interested.

The resulting files can't be tested for identical results, since prep_word2vec introduces NA in places for reasons I don't understand.

# Presumably these will be available as imports
# require(readr)
# require(stringr)
# require(tokenizers)
library(magrittr)
library(wordVectors)

prep_word2vec_alt <- function(origin, destination, lowercase = TRUE) {
  files <- list.files(origin, recursive = TRUE, full.names = TRUE)
  Map(prep_single_file, files, destination, lowercase)
  invisible(destination)
}

prep_single_file <- function(file_in, file_out, lowercase) {
  message("Prepping ", file_in)

  text <- file_in %>%
    readr::read_file() %>%
    tokenizers::tokenize_words(simplify = TRUE, lowercase) %>%
    stringr::str_c(collapse = " ")

  stopifnot(length(text) == 1)
  readr::write_lines(text, file_out, append = TRUE)
  return(TRUE)
}

original_time <- system.time({
  cookbooks <- prep_word2vec("cookbooks", "cookbook.txt", lowercase = TRUE)
  })
alternate_time <- system.time({
  cookbooks_alt <- prep_word2vec_alt("cookbooks", "cookbook-alt.txt",
                                                lowercase = TRUE)
  })

original_time
alternate_time

> original_time
   user  system elapsed 
 25.402  27.369  52.729 
> alternate_time
   user  system elapsed 
  4.903   0.364   5.809

How to get plot points

Hi!

I am able to see the plot on my screen using R Studio, but I'm unable to get the x- and y- coordinates of the words I plotted.

plot(model, perplexity=50)

Also, how do I write those plot points (x- and y- coordinates, not the vectors) to a csv or txt file?

Training progress report and speed

Hi,

I installed the latest version recently.
There are two differences compared to previous version.

no progress index
seems like slower.

Can I used the progress index for the latest version?
And any reason with seemingly slowing-down ?

Quick start - error in type.convert

Problem:
Running through_ Quick Start_ instructions in README.md the process dies with an error in type.convert.

> model = train_word2vec("cookbooks.txt",output="cookbooks.vectors",threads = 3,vectors = 100,window=12)
Starting training using file /home/brandon/repo/stack/data/cookbooks.txt
Vocab size: 32421
Words in train file: 10577282
Alpha: 0.000195  Progress: 99.24%  Words/thread/sec: 18.39k  
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals,  : 
  invalid multibyte string at '<f6>(<83>;<a4><d0>�;��{<bb>{<d4>V<bb><b8>�<b3>:q<fd>E;ףv:<9a><99>]9<f6>(l<bb><d7>c�;'

I tried to retrain model on a small subset of cookbooks and that failed similarly.

> model = train_word2vec("cookbooks.txt",output="cookbooks.vectors",threads = 3,vectors = 100,window=12, force=T)
Starting training using file /home/brandon/repo/stack/data/cookbooks.txt
Vocab size: 5331
Words in train file: 345615
Alpha: 0.000073  Progress: 100.34%  Words/thread/sec: 19.73k  
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals,  : 
  invalid multibyte string at '<f6>(<83>;<a4><d0>�;��{<bb>{<d4>V<bb><b8>�<b3>:q<fd>E;ףv:<9a><99>]9<f6>(l<bb><d7>c�;'

It appear to be choking on an usual character or unexpected byte. Was there a change in the way the cookbook data was initially saved versus how it is currently processed? The following warnings are also shown:

In addition: Warning messages:
1: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 1 appears to contain embedded nulls
2: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 2 appears to contain embedded nulls
3: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 5 appears to contain embedded nulls
4: In utils::read.table(filename, header = F, skip = 1, nrows = 1,  :
  line 1 appears to contain embedded nulls

Additional system details:
OS - Ubuntu 14.04
R - [1] "R version 3.2.3 (2015-12-10)"

wrap glove training

I've been using this package a bit to explore the standard GloVe model distributed by Stanford.

It might be useful to let this package train GloVe models as well using text2vec.

I suspect this isn't why people install it, but I see the major advantage of this package being the syntax and function wrapping; text2vec's creator want to keep a minimal feature set, so I think there's relatively little overlap between the two packages.

Installation of Package under Windows 7 64-bit raises errors

Hi guys,

I am trying to install the package under Windows 7, 64-bit. I use R 3.2.4 and RTools 3.3. Unfortunately I get the error when trying to install the package (see the error trace below). Is there any ideas on how to fix it?

Thank you.

P.S. Below is the error trace:

`> devtools::install_github("bmschmidt/wordVectors")
Downloading GitHub repo bmschmidt/wordVectors@master
from URL https://api.github.com/repos/bmschmidt/wordVectors/zipball/master
Installing wordVectors
"C:/R/R-32~1.4RE/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  "C:/Users/sqladmin/AppData/Local/Temp/Rtmp6bqZZH/devtools21f05e8c6da7/bmschmidt-wordVectors-7f1914c"  \
  --library="C:/R/R-3.2.4revised/library" --install-tests 

* installing *source* package 'wordVectors' ...
** libs

*** arch - i386
gcc -m32 -I"C:/R/R-32~1.4RE/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local323/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w   -O3 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
gcc -m32 -I"C:/R/R-32~1.4RE/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local323/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w   -O3 -Wall  -std=gnu99 -mtune=core2 -c word2phrase.c -o word2phrase.o
gcc -m32 -shared -s -static-libgcc -o wordVectors.dll tmp.def tmcn_word2vec.o word2phrase.o -pthread -Ld:/RCompile/r-compiling/local/local323/lib/i386 -Ld:/RCompile/r-compiling/local/local323/lib -LC:/R/R-32~1.4RE/bin/i386 -lR
installing to C:/R/R-3.2.4revised/library/wordVectors/libs/i386

*** arch - x64
gcc -m64 -I"C:/R/R-32~1.4RE/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local323/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w   -O2 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s: Assembler messages:
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1135: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1154: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1158: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1163: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1168: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1173: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1178: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1183: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1193: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1197: Error: no such instruction: `vfmadd312ss (%rbx,%r13,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1201: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1205: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1209: Error: no such instruction: `vfmadd312ss (%rbx,%r13,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1213: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1217: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1220: Error: no such instruction: `vfmadd312ss (%rbx,%r13,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1229: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1248: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1252: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1257: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1262: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1267: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1272: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1277: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1287: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1291: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1295: Error: no such instruction: `vfmadd312ss (%rax,%r13,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1299: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1303: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1307: Error: no such instruction: `vfmadd312ss (%rax,%r13,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1311: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1314: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1878: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1897: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1901: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1906: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1911: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1916: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1921: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1926: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1936: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1940: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1944: Error: no such instruction: `vfmadd312ss (%rbx,%r10,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1948: Error: no such instruction: `vfmadd312ss (%rbx,%r15,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1952: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1956: Error: no such instruction: `vfmadd312ss (%rbx,%r10,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1960: Error: no such instruction: `vfmadd312ss (%rbx,%r15,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1963: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1972: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1991: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1995: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2000: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2005: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2010: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2015: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2020: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2029: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2033: Error: no such instruction: `vfmadd312ss (%rax,%r15,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2037: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2041: Error: no such instruction: `vfmadd312ss (%rax,%r10,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2045: Error: no such instruction: `vfmadd312ss (%rax,%r15,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2051: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2054: Error: no such instruction: `vfmadd312ss (%rax,%r10,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2057: Error: no such instruction: `vfmadd312ss (%rax,%r15,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2238: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2257: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2261: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2266: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2271: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2276: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2281: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2286: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2296: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2299: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2303: Error: no such instruction: `vfmadd312ss (%rbx,%r9,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2306: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2311: Error: no such instruction: `vfmadd312ss (%rbx,%r9,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2314: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2320: Error: no such instruction: `vfmadd312ss (%rbx,%r9,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2323: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2332: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2351: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2355: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2360: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2365: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2370: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2375: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2380: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2389: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2392: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2396: Error: no such instruction: `vfmadd312ss (%rax,%rbp,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2400: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2404: Error: no such instruction: `vfmadd312ss (%rax,%rbp,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2408: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2414: Error: no such instruction: `vfmadd312ss (%rax,%rbp,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2417: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2475: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2494: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2498: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2503: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2508: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2513: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2518: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2523: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2533: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2536: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2540: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2543: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2548: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2551: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2557: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2560: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2569: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2588: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2592: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2597: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2602: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2607: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2612: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2617: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2627: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2630: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2634: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2637: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2642: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2645: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2651: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2654: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm7'
make: *** [tmcn_word2vec.o] Error 1
Warning: running command 'make -f "Makevars.win" -f "C:/R/R-32~1.4RE/etc/x64/Makeconf" -f "C:/R/R-32~1.4RE/share/make/winshlib.mk" SHLIB="wordVectors.dll" WIN=64 TCLBIN=64 OBJECTS="tmcn_word2vec.o word2phrase.o"' had status 2
ERROR: compilation failed for package 'wordVectors'
* removing 'C:/R/R-3.2.4revised/library/wordVectors'
Error: Command failed (1)`

doc2vec

Thanks for developing this package. Do you have any plans to implement a doc2vec model in the future?

input text file

I was using train_word2vec function to train vector.
But I was wondering that what is the best input text form.
Should I separate sentence line-by-line? Would it interfere training?

What is the algorithm for dealing space or Line break ?

Thank you.

Vocabulary size is much smaller than it ought to be

"Vocab size" is way off. See the attached screen shot: 1.4 billion words of PubMedCentral author manuscripts, and the vocabulary size is 1,307, according to the status message output. Doesn't seem likely.

I'm using whatever version of wordVectors was on GitHub as of mid-January 2015. Not sure what version of RStudio--I think those 1.4 billion words of text have choked my laptop to death... OS X.

float?

Hi - it's mentioned that single precision support isn't available in R. Perhaps that has changed?

When I do an install.packages('float') a package that supports single precision is installed

R Session Aborted

When I run train_word2vec() R crashes immediately. The file to be imported is 200 novels run through prep_word2vec() which results in a 120 MB .txt file. I've tried on Mac 10.9 and 10.10 as well as R 3.1 and 3.2. Same result in all cases. I'm guessing you've run this on much larger data. Any ideas?

include vignette?

Hello,

I figured out how to run install_github in such a way as to compile the vignette, but it's taking a long time. Can we include a PDF vignette in the repo and link to it from README.md? It could even be included in a separate branch if you don't want it to pollute the revision history.

I was surprised not to be able to find a PDF vignette of this project on Google. It would be very useful. I'm trying to figure out how to make a 2-d scatter plot from a given subset of the word vectors, it looks like that is being done in the vignette but it would be useful to be able to see the example plot output first.

Thanks!

Add function to align different models

This Stanford paper describes the most promising method I've seen so far for aligning multiple different models; it would be a useful addition here.

In order to compare word vectors from differ- ent time-periods we must ensure that the vectors are aligned to the same coordinate axes. Ex- plicit PPMI vectors are naturally aligned, as each column simply corresponds to a context word. Low-dimensional embeddings will not be natu- rally aligned due to the non-unique nature of the SVD and the stochastic nature of SGNS. In par- ticular, both these methods may result in arbi- trary orthogonal transformations, which do not af- fect pairwise cosine-similarities within-years but will preclude comparison of the same word across time. Previous work circumvented this problem by either avoiding low-dimensional embeddings (e.g., Gulordava and Baroni, 2011; Jatowt and Duh, 2014) or by performing heuristic local align- ments per word (Kulkarni et al., 2014).
We use orthogonal Procrustes to align the learned low-dimensional embeddings. Defining W(t) ∈ Rd×|V| as the matrix of word embeddings learned at year t, we align across time-periods while preserving cosine similarities by optimizing:
R(t) = arg min ∥W(t)Q − W(t+1)∥F , (4) Q⊤ Q=I
with R(t) ∈ Rd×d. The solution corresponds to the best rotational alignment and can be obtained efficiently using an application of SVD (Scho ̈nemann, 1966).

Restore printing of status updates

In switching from printf to the CRAN-approved Rprintf, I've hit a problem. The C code
wants to print out status updates to the console, here. It used to work fine; but when these lines are uncommented,
R crashes with an error that C stack usage exceeds 261600796060 or something of the sort. The place at which the crash comes is proportional to how often the loop is run (if I change the counter here to once every 1000 lines, it crashes ten times sooner, and ten times later if it's once every 100,000 lines). So apparently calling Rprintf creates a memory leak. I can't find any solutions to this online.

For now I've turned off printing, but this is is a process that can take hours, and visual feedback is extremely useful.

Windows 8 (but not Windows 7 or 10?) compilation fails

Hi there,

I am having trouble installing this on Windows 8 64-bit. I do have Rtools installed as well.

> devtools::install_github("bmschmidt/wordVectors")
Downloading GitHub repo bmschmidt/wordVectors@master
Installing wordVectors
"C:/PROGRA~1/R/R-32~1.1/bin/x64/R" --no-site-file --no-environ --no-save  \
  --no-restore CMD INSTALL  \
  "C:/Users/MotoBot/AppData/Local/Temp/Rtmpq836VL/devtools21f45c5a6f54/bmschmidt-wordVectors-cfd14a5"  \
  --library="C:/Users/MotoBot/Documents/R/win-library/3.2" --install-tests 

* installing *source* package 'wordVectors' ...
** libs

*** arch - i386
gcc -m32 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O3 -Wall  -std=gnu99 -mtune=core2 -c tmcn_distance.c -o tmcn_distance.o
In file included from tmcn_distance.c:2:0:
distance.h: In function 'distance':
distance.h:40:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:40:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:41:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:41:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:42:3: warning: implicit declaration of function 'malloc' [-Wimplicit-function-declaration]
distance.h:42:19: warning: incompatible implicit declaration of built-in function 'malloc' [enabled by default]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: too many arguments for format [-Wformat-extra-args]
distance.h:84:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:84:5: warning: too many arguments for format [-Wformat-extra-args]
gcc -m32 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O3 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
In file included from tmcn_word2vec.c:3:0:
word2vec.h: In function 'LearnVocabFromTrainFile':
word2vec.h:280:7: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:280:7: warning: format '%c' expects argument of type 'int', but argument 2 has type 'long long int' [-Wformat]
word2vec.h:280:7: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:292:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:292:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:293:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:293:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'SaveVocab':
word2vec.h:302:3: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:302:3: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'ReadVocab':
word2vec.h:321:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:321:5: warning: format '%c' expects argument of type 'char *', but argument 3 has type 'long long int *' [-Wformat]
word2vec.h:321:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:326:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:326:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:327:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:327:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'TrainModelThread':
word2vec.h:367:36: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
word2vec.h:373:50: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
word2vec.h: In function 'TrainModel':
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: too many arguments for format [-Wformat-extra-args]
tmcn_word2vec.c: In function 'tmcn_word2vec':
tmcn_word2vec.c:12:9: warning: assignment makes pointer from integer without a cast [enabled by default]
tmcn_word2vec.c: In function 'TrainModelThread':
word2vec.h:530:1: warning: control reaches end of non-void function [-Wreturn-type]
gcc -m32 -shared -s -static-libgcc -o wordVectors.dll tmp.def tmcn_distance.o tmcn_word2vec.o -pthread -Ld:/RCompile/r-compiling/local/local320/lib/i386 -Ld:/RCompile/r-compiling/local/local320/lib -LC:/PROGRA~1/R/R-32~1.1/bin/i386 -lR
installing to C:/Users/MotoBot/Documents/R/win-library/3.2/wordVectors/libs/i386

*** arch - x64
gcc -m64 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O2 -Wall  -std=gnu99 -mtune=core2 -c tmcn_distance.c -o tmcn_distance.o
In file included from tmcn_distance.c:2:0:
distance.h: In function 'distance':
distance.h:40:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:40:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:41:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:41:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:42:3: warning: implicit declaration of function 'malloc' [-Wimplicit-function-declaration]
distance.h:42:19: warning: incompatible implicit declaration of built-in function 'malloc' [enabled by default]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: too many arguments for format [-Wformat-extra-args]
distance.h:84:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:84:5: warning: too many arguments for format [-Wformat-extra-args]
gcc -m64 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O2 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
In file included from tmcn_word2vec.c:3:0:
word2vec.h: In function 'LearnVocabFromTrainFile':
word2vec.h:280:7: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:280:7: warning: format '%c' expects argument of type 'int', but argument 2 has type 'long long int' [-Wformat]
word2vec.h:280:7: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:292:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:292:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:293:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:293:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'SaveVocab':
word2vec.h:302:3: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:302:3: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'ReadVocab':
word2vec.h:321:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:321:5: warning: format '%c' expects argument of type 'char *', but argument 3 has type 'long long int *' [-Wformat]
word2vec.h:321:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:326:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:326:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:327:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:327:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'TrainModel':
word2vec.h:544:84: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: too many arguments for format [-Wformat-extra-args]
tmcn_word2vec.c: In function 'tmcn_word2vec':
tmcn_word2vec.c:12:9: warning: assignment makes pointer from integer without a cast [enabled by default]
tmcn_word2vec.c: In function 'TrainModelThread':
word2vec.h:530:1: warning: control reaches end of non-void function [-Wreturn-type]
C:\Users\MotoBot\AppData\Local\Temp\cc88Kdj9.s: Assembler messages:
C:\Users\MotoBot\AppData\Local\Temp\cc88Kdj9.s:1094: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm8'

[... edited out by BMS--dozens of other assembler errors]

make: *** [tmcn_word2vec.o] Error 1
Warning: running command 'make -f "Makevars.win" -f "C:/PROGRA~1/R/R-32~1.1/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-32~1.1/share/make/winshlib.mk" SHLIB="wordVectors.dll" WIN=64 TCLBIN=64 OBJECTS="tmcn_distance.o tmcn_word2vec.o"' had status 2
ERROR: compilation failed for package 'wordVectors'
* removing 'C:/Users/MotoBot/Documents/R/win-library/3.2/wordVectors'

My environment:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
[3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_South Africa.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] httr_1.0.0     R6_2.1.1       magrittr_1.5   tools_3.2.1    curl_0.9.3    
 [6] memoise_0.2.1  stringi_1.0-1  knitr_1.11     stringr_1.0.0  digest_0.6.8  
[11] devtools_1.9.1

Any idea how I can fix this?

Thanks

rword2vec, R session aborted

Hello,

I just installed rword2vec, but the session was aborted when I tried to use it. My code is below. Although this question has been asked before, this issue still remained unsolved. Could you please help on this issue?

library(rword2vec)
model <- word2vec(
train_file = "text8",
output_file = "vec.bin",
binary=1,
num_threads=3,
debug_mode=1)

Unable to compile with gcc 10 | Arch Linux solution = downgrade

If installing the package in Arch Linux, it failes to compile with the following error:

ccache gcc -shared -L/usr/lib64/R/lib -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -o wordVectors.so tmcn_word2vec.o word2phrase.o -L/usr/lib64/R/lib -lR
/usr/bin/ld: word2phrase.o:(.bss+0x18): multiple definition of `vocab'; tmcn_word2vec.o:(.bss+0x78): first defined here
collect2: Fehler: ld gab 1 als Ende-Status zurück
make: *** [/usr/share/R//make/shlib.mk:6: wordVectors.so] Fehler 1
ERROR: compilation failed for package ‘wordVectors’

This seems to be an error raised by gcc 10 and should be fixable quite easily.

Temporary Fix (see here)

Downgrade gcc and related packages (in my case gcc-libs and gcc-fortran) simultaneously to version 9.3.0-1

Either downgrade if you still have the old cached version or download them from
https://archive.archlinux.org/packages/g/gcc/gcc-9.3.0-1-x86_64.pkg.tar.zst
https://archive.archlinux.org/packages/g/gcc-libs/gcc-libs-9.3.0-1-x86_64.pkg.tar.zst and
https://archive.archlinux.org/packages/g/gcc-fortran/gcc-fortran-9.3.0-1-x86_64.pkg.tar.zst

Install using sudo pacman -U gcc-9.3.0-1-x86_64.pkg.tar.zst gcc-libs-9.3.0-1-x86_64.pkg.tar.zst gcc-fortran-9.3.0-1-x86_64.pkg.tar.zst. It's important to downgrade them at the same time because one depends on the other.

To prevent automated updates off gcc till the issue is resolved add the three packages to IgnorePkg in /etc/pacman.conf

Error when using reject()

I'm experimenting with the kind of vector rejection described here: http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html

After creating my model:

ff_vectors = train_word2vec("data/processed_tweets.txt")

I try:

beast = ff_vectors[["beast"]] %>% reject(ff_vectors[["points"]])

and get:

Error in crossprod(t(matrix %*% b)/as.vector((b %*% b)), b) : 
  non-conformable arguments

Any help would be appreciated. I'm very interested in working more with the package.

is there a way to get the token frequencies

Great tool!

I couldn't figure out how words are sorted by frequency if the frequencies are not part of the .bin file or the VectorSpaceModel. I guess the frequencies are tracked in the code which does the training, but left out of the trained vector file? Maybe I'll use 1/rank (Zipf's law) to approximate the frequency, but it would be good to have this documented somewhere. Thanks!

how convert a data frame or table to a VectorSpaceModel?

I have a word vector data frame created outside of wordVectors that currently looks like
V1 V2 V3 V4 V5 .............
1 der -0.1292338 1.41541564 0.72683984 -0.08601953
2 die -0.7408874 1.23070979 1.60728443 0.21427894
3 und 0.1368700 0.21688898 0.09194378 -0.42764056
4 in -0.9566143 1.17804027 0.13917272 1.63949668
5 von -1.2693109 0.92857528 -0.88062751 1.41522074
6 den -0.8766794 0.45545051 1.42592216 -1.87232220
7 des 0.8585002 0.80657679 2.12942553 -1.49346220
8 im -1.8885295 0.35904437 0.97661573 -0.38748211
9 mit -0.5756816 -1.57236266 -2.10877585 1.33090031
10 das -0.9001577 -0.02004211 1.45430076 0.93866318
...
and would like to use wordVectors operations. I see that there is a as.VectorSpaceModel(matrix) coercion function, but I don't know the form of the matrix that is required for the coercion to work.

is wordVectors on cran?

I get error from install of:
package 'wordVectors' is not available (for R version 3.4.2)

Is it available for different R version?

install wrong

R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

devtools::install_github("bmschmidt/wordVectors")
Downloading GitHub repo bmschmidt/wordVectors@master
from URL https://api.github.com/repos/bmschmidt/wordVectors/zipball/master
错误: 运行命令'"C:/Program Files/R/R-3.4.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD config CC'的状态是2

could not find the cookbook data

The demo data is not available at least from China, hope using an interior data source for demo.

prep_word2vec requires R >= 3.2.0

hi am getting this error when trying to parse the text:

Beginning tokenization to text file at cookbooks.txt
Error in prep_word2vec("cookbooks", "cookbooks.txt", lowercase = T) :

could not find function "dir.exists"

Test training on travis

I've removed the training tests from Travis because I can't figure out how to get them to work. Something seems to break when I try to write to tmp file.

They should be reactivated.

examples of reject

I've been having a lot of fun playing with this. I have a feature request relating to the 'reject' function.

I noticed that your examples in the "?reject" help page have us applying it to a full model, but README.md has a "bank" example where you apply it to the vector you are querying.

Is it possible to summarize the difference between these two ways of querying? The second way is much faster of course. They seem to produce results which are similar but not equivalent.

Also, what is the meaning of "/" in a formula argument to "closest_to"? Does it do actual division? I would have expected it to do something like the "reject" operation in the "bank" example.

I guess my request/suggestion is (1) expand the documentation for "?reject" to have both modes of usage and (2) add syntactic sugar for the "reject" operation.

Thank you.

Fatal Error on 1 MB file

Congratulations, the package is great and thanks for developing it.

I've teste with some standard dataset and it works great (including the 50 MB cookbooks). However, when using it with a personal 1 MB dataset written in Brazilian Portuguese, R crashes every single time. I've already removed punctuation and excess white space, tried with 1/2/4/8 threads, 100/200/500 vectors and with/without removing stopwords, but got no better result. Do you have any idea what it can be the reason of this crash?

trouble using wordVectors in Rscript.

Pasted from an e-mail I received for tracking.

I’m having trouble using wordVectors in Rscript.
A minimal case:

library(magrittr)
library(wordVectors)

model <- read.vectors('foo.bin')
model %>% closest_to('man') %>% print()

This code works fine in an interactive R session, but it fails when run via Rscript:


Error in context[[formula]] : subscript out of bounds
Calls: %>% ... <Anonymous> -> closest_to -> cosineSimilarity -> sub_out_formula
Execution halted

(I get the same failure if I rewrite it to use traditional(function(nesting(syntax))) instead of magrittr, BTW.)

Code like

cosineSimilarity(model[['man']],model[['woman']]) %>% print()

also fails similarly in Rscript but works when stepped through or source()’d in an interactive R session.

Back in 2015 I used wordVectors extensively in Rscript with no problems, so whatever’s going on here seems to be connected to changes since then.

Sentence Window/context truncation

Great package! Quick question: It seems that the train_word2vec function only takes a single txt file as input. The original word2vec code can take inputs in paragraph/sentence format (which I guess would be a list of lists in R) and automatically truncates the window so there is no overlap in contexts across sentences. Is there a way to do that with wordVectors? I think not but wanted to ask.

Extract Network Weights

Hey---great work here. Sorry if this is obtuse, but is there a way to extract the neural network weight matrices after training?

Thanks!

Working with date-time format, cant handle POSIXct. (Error in as.POSIXlt.numeric(x) : 'origin' must be supplied)

Hello,

I have a date-time column in my database in a format of "2017-01-02 8:27" as example. I want to add 10 minutes to this date-time version.

dat$EventTime=as.POSIXct(strptime( dat$EventTime, "%Y-%m-%d %H:%M"), tz = "", origin = '1970-01-01 00:00')

##date-time format becomes 2017-01-02 08:27:00 which is ok, however when I try to add 10 minutes

dat$EventTime[1]+minute(10)

I come across with this error

Error in as.POSIXlt.numeric(x) : 'origin' must be supplied

Could you please help me with that issue?

Thanks,

subscript out of bounds

Hi Ben,

I'm playing with wordvectors again, trying to replicate your genderless post on Melodee Beals' colonial newspaper database. Everything else is working fine, but when I do this:

genderless_cnd = cnd %>% reject(cnd[["he"]] - cnd[["she"]]) %>% reject(cnd[["man"]] - cnd[["woman"]])
#gendered CND:
cnd %>% nearest_to(cnd[["she"]],20) %>% names
#genderless CND:
genderless_cnd %>% nearest_to(genderless_cnd[["she"]],20) %>% names

I get the error:

Error in genderless_cnd[["she"]] : subscript out of bounds

I've tried paging through the bug tracking by options(error=recover) but I'm afraid it's beyond me. Wondered what you thought - have I messed something up?

Training on bigrams?

Hey! Wonderful package. Is there currently support for training the word2vec model on bigrams as well as unigrams? I've had some nice success using bigrams via gensim, and was wondering if I'm missing the way to include that here.

Thanks! Again, great work.

Use binary format by default

I just added code to read the binary format. It's about a third the size and takes about a third less time to read in, so I see no reason not to ultimately use it as the standard data interchange format instead of the text representation. Worth making sure that unicode token labels are making it through the gauntlet first, though.

Interestingly, the text version gzips down potentially a little smaller than the binary ones. If space is the only thing that matters, maybe we'd want to look at reading gzips. But faster read/write times are important, too.

problem to call library

When I use: library(wordVectors-2.0)
it says:
Error in library(wordVectors-2.0) :
there is no package called ‘wordVectors-2.0’

same for wordVector-master or any branch of your project in github

I´m new using RStudio... Then I checked the web and following some advice I manually deleted the folder to go to

install.packages("D:/Folder/wordVectors-2.0.zip")
Installing package into ‘C:/Users/Admin/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘D:/Folder/wordVectors-2.0.zip’ is not available (for R version 3.3.2)
And this time it did not created any folder, so it seems to be a compatibility problem?? because of "package ‘D:/FolderL/wordVectors-2.0.zip’ is not available (for R version 3.3.2)"????

Fallback to read.binary.vectors if read.vectors fails

One problem with the switch to writing in the binary format is that users who upgrade (which is basically forced with every R version bump) may have code that gets broken if they use the suffix '.vectors'. (They can still read in an old model; but now if they train a new one with the old suffix, it chokes on read.)

One easy way to solve this would be to try-catch the read.vectors function: if it fails to read something as text, we could just give a short at reading it in binary format. If the results are plausible (how to test? Simply by length, probably; or else valid unicode-ness of the row names), return them.

in train_word2vec change output="test.vectors" to output = "test.bin"

had old code and realized needed to change output to .bin.

can't delete this comment once I figured it out. sorry for spam but maybe this might happen to someone else.

Can we generate .vector files

Hello Ben,

As "*.bin" output vector files are not readable? Is there any way to generate ".Vectors" file as it was possible in earlier releases of word2vec?

I am asking because ".vector" files are easier to convert to tensor flow's ".bytes" format to visualise them in tensor flow projector. ".vector" file format matched Mikolov's original format for capturing vectors.

I want to be able to generate output vector files which are readable.

Thanks for great work in developing this package.

Regards

n-grams greater than 2

I was looking to use trigrams because there are significant three-word phrases in my corpus (e.g. "economies in transition" to refer to developing countries). I used the following code in R.

statements <- prep_word2vec(basePath,
"docs.txt",
lowercase=T, bundle_ngrams = 3, threshold = 50)

w2v <- train_word2vec("docs.txt",
output="./stat_vecs.bin",
threads=detectCores(),
vectors=100,
window=7,
force=TRUE)

It worked as expected with the exception that I got some four word phrases (e.g. "so_that_they_can"). I'm curious why this is happening. Thanks!

Add function to 'improve' models

This article spells out a pretty simple pair of tricks that supposedly makes pre-trained embeddings perform better on most benchmarks. I've implemented it in R, but don't have the benchmarks locally to be sure it's useful. I will try to bundle it up into this package at some point.

Error on loading large model

I am trying to load the standard google news model, but I received this error:

Error in validObject(.Object) : invalid class “VectorSpaceModel” object: Error : cannot allocate vector of size 6.7 Gb

I tried on my macbook and my linux workstation, neither worked. I have no trouble loading this model in gensim.

Also, these are my R stats:

platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.1
year 2016
month 06
day 21
svn rev 70800
language R
version.string R version 3.3.1 (2016-06-21)
nickname Bug in Your Hair

Cache magnitudes to improve repeated cosine similarity queries performance

After reading binary file of 320895 rows and 100 columns, it takes around 15-20 seconds to get result of "nearest_to" call. Is this normal behaviour or rather some quirk? The gensim word2vec implementation is rather quick on the exactly same model (it takes less than 1s to get similar vectors).

The code is the following:
model <- read.vectors("model.bin")
nearest_to(model,model[["word"]])

bmschmidt / wordvectors Goto Github PK

wordvectors's Introduction

Word Vectors

Description

Quick start

Credit

Creating text vectors.

VectorSpaceModel object

Faster Access to text vectors

A few native functions defined on the VectorSpaceModel object.

Useful matrix operations

Quick start

Install the wordVectors package.

Train a model.

Explore an existing model.

wordvectors's People

Contributors

Stargazers

Watchers

Forkers

wordvectors's Issues

Temporary Fix (see here)

hi am getting this error when trying to parse the text:

could not find function "dir.exists"

Recommend Projects

Recommend Topics

Recommend Org

Jobs