GithubHelp home page GithubHelp logo

saraswatmks / superml Goto Github PK

View Code? Open in Web Editor NEW
32.0 2.0 8.0 2.7 MB

Build machine learning models in R like using python's scikit-learn library

Home Page: https://saraswatmks.github.io/superml/

License: GNU General Public License v3.0

R 85.38% C++ 14.62%
r r-package rstats

superml's Issues

Count Vectorizer doesn't clean query

In next iteration, count vectorizer should have following features:

  • Change max_feature parameter to count rather than percentage

  • Check for stopwords and remove them if needed

  • Clean tokens (remove punctuations, lemmatize etc.)

  • Add tokenize method = 'word' or 'char' (to split at word level)

TfIdfVectorizer Transform Problem

Hi Administrator,

I was using the superml package. I have trained the TfIdfVectorizer and was able to transform it on the training set with no problem; however, when I try to transform the testing set, I end up getting the same TfIdf matrix as the training set.

In addition, to this error, II ran into a bug trying to install the latest version on Windows. When I try to install this package, I get this error. However, it works perfectly fine, when I try to install this on a Mac.

C:/Rtools/mingw_64/x86_64-w64-mingw32/include/c++/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
#error This file requires compiler and library support for the
^
utils.cpp: In function 'Rcpp::CharacterVector superSplit(std::string, char)':
utils.cpp:15:25: error: 'move' is not a member of 'std'
elems.push_back(std::move(item));
^
utils.cpp: In function 'std::vector<std::basic_string > superNgrams(std::string, Rcpp::NumericVector, char)':
utils.cpp:45:68: error: '>>' should be '> >' within a nested template argument list
std::vectorstd::string rx = as<std::vector>(r);
^
utils.cpp: In function 'std::vector<std::basic_string > superTokenizer(std::vector<std::basic_string >)':
utils.cpp:63:9: warning: 'auto' changes meaning in C++11; please remove it [-Wc++0x-compat]
for(auto i: string){
^
utils.cpp:63:14: error: 'i' does not name a type
for(auto i: string){
^
utils.cpp:72:5: error: expected ';' before 'return'
return output;
^
utils.cpp:72:5: error: expected primary-expression before 'return'
utils.cpp:72:5: error: expected ';' before 'return'
utils.cpp:72:5: error: expected primary-expression before 'return'
utils.cpp:72:5: error: expected ')' before 'return'
utils.cpp: In function 'Rcpp::NumericMatrix superCountMatrix(std::vector<std::basic_string >, std::vector<std::basic_string >)':
utils.cpp:82:20: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i=0; i < sent.size(); i++){
^
utils.cpp:85:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int j=0; j < tokens.size(); j++){
^
utils.cpp:86:13: error: 'regex' was not declared in this scope
regex e = std::regex("\b" + tokens[j] + "\b");
^
utils.cpp:87:42: error: 'e' was not declared in this scope
string m = regex_replace (s, e, "");
^
utils.cpp:87:47: error: 'regex_replace' was not declared in this scope
string m = regex_replace (s, e, "");
^
utils.cpp: In function 'std::vector<std::basic_string > superTokenizer(std::vector<std::basic_string >)':
utils.cpp:74:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
make: *** [C:/PROGRA1/MICROS3/ROPEN1/R-351.3/etc/x64/Makeconf:215: utils.o] Error 1
ERROR: compilation failed for package 'superml'

Can you please help? Thanks.

Best,
jonyclee

CountVectorizer split argument doesn't do anything

I have the following example

# should be a vector of texts
sents <-  c('i, am, going, home, and, home',
          'where, are, you , going.? //// ',
          'how, does, it, work')

cfv <- CountVectorizer$new(max_features = 10, remove_stopwords = FALSE, split = ", )

# generate the matrix
cf_mat <- cfv$fit_transform(sents)

head(cf_mat, 3)

As you can see after executing it, it doesn't split on the comma sign, but splits on space again.

Is this a bug? Would a Pull Request be welcome?
Thanks in advance!

GridSearchCV error

Hello there,
I am trying to run GridSearchCV, but I am getting the following error:
Error in if (cuts < 2) cuts <- 2 : missing value where TRUE/FALSE needed

the commands that I am trying to run are the following:

rf <- RFTrainer$new()
gridSearch <- GridSearchCV$new(rf, 
	parameters = list(n_estimators = c(10, 20, 30, 50, 70, 80, 100, 120, 150, 180, 200, 220, 280, 320), 
	max_features = c("auto", "sqrt", "log2"), 
	oob_score=c(TRUE, FALSE), bootstrap=c(TRUE, FALSE), 
	class_weight = c("None", "balanced"), criterion=c("gini", "entropy")))
gridSearch$fit(train, "data.cl")

Note: I tried that on the Iris dataset, and it also returns the same error.

So what do you suggest?
Thank in advance.

Removed from CRAN?

I can't install the package via CRAN and had to install with devtools. Is there a plan to put it back to CRAN?

Error due to liquidSVM removed from CRAN

Hi,
Thanks for using data.table. Since superml uses data.table, I check it in revdep testing of data.table. superml is in error status on CRAN currently and for me locally due to liquidSVM being removed from CRAN. No rush : just to make sure you're aware.
Also needed can be removed in this line :
R/super_utils.R: stop(paste0("Need Package " , pkg, "needed for this function to work. Please install it."),

$ more 00check.log 
* using log directory ‘/home/mdowle/build/revdeplib/superml.Rcheck’
* using R version 3.6.0 (2019-04-26)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘superml/DESCRIPTION’ ... OK
* checking extension type ... Package
* this is package ‘superml’ version ‘0.3.0’
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... NOTE
Package suggested but not available for checking: ‘liquidSVM’
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘superml’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking ‘build’ directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking contents of ‘data’ directory ... OK
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking installed files from ‘inst/doc’ ... OK
* checking files in ‘vignettes’ ... OK
* checking examples ... ERROR
Running examples in ‘superml-Ex.R’ failed
The error most likely occurred in:

> ### Name: SVMTrainer
> ### Title: Support Vector Machines Trainer
> ### Aliases: SVMTrainer
> ### Keywords: datasets
> 
> ### ** Examples
> 
> data(iris)
> ## Multiclassification
> svm <- SVMTrainer$new(type="mc")
Error: Need Package liquidSVMneeded for this function to work. Please install it.
Execution halted
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ... OK
  Running ‘test_miscs.R’
  Running ‘testthat.R’
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking running R code from vignettes ... NONE
* checking re-building of vignette outputs ... WARNING
Error(s) in re-building vignettes:
  ...
--- re-building ‘introduction.Rmd’ using rmarkdown
Quitting from lines 110-115 (introduction.Rmd) 
Error: processing vignette 'introduction.Rmd' failed with diagnostics:
Need Package liquidSVMneeded for this function to work. Please install it.
--- failed re-building ‘introduction.Rmd’

SUMMARY: processing the following file failed:
  ‘introduction.Rmd’

Error: Vignette re-building failed.
Execution halted

* checking PDF version of manual ... OK
* DONE
Status: 1 ERROR, 1 WARNING, 1 NOTE

SVMTrainer

A question about how to use quantile regression in SVM? I know this package can achieve support vector quantile regression by controlling the parameter type as qt, but there are some problems and I don't know how to achieve this, could you give me an example about SQVR in R.

transform doesn't do anything with supplied arguments?

Running this example in from the tutorial
as shown below, it looks like TfIdfVectorizer$transform() doesn't do anything with the supplied arguements since bothtrain_tf_featuresand test_tf_features are identical or this isn't a great example since there's no difference in the output.

Here's the example from the tutorial along with my additional code showing that they are identical.

library(data.table)
library(superml)

# use sents from above
sents <-  c('i am going home and home',
            'where are you going.? //// ',
            'how does it work',
            'transform your work and go work again',
            'home is where you go from to work',
            'how does it work')

# create dummy data
train <- data.table(text = sents, target = rep(c(0,1), 3))
test <- data.table(text = sample(sents), target = rep(c(0,1), 3))

# initialise the class, set parallel to TRUE for fast computation
tfv <- TfIdfVectorizer$new(min_df = 0.3, remove_stopwords = FALSE, ngram_range = c(1,3), parallel = FALSE)

# we fit on train data
tfv$fit(train$text)

train_tf_features <- tfv$transform(train$text)
test_tf_features <- tfv$transform(test$text)

#WTF, they're the same
identical(train_tf_features, test_tf_features)

Plan to add ngram_range argument for CountVectorizer()?

Hi. First of all, this is a great idea. I build ML models for text analysis both in R and Python. It has been a headache to translate Python code in R. This package is a great help. I looked up the CountVectorizer() function and I was wondering whether you plan to add ngram_range argument to it. Thanks.

Missing SVMTrainer()

svm = SVMTrainer$new()
Error: object 'SVMTrainer' not found

packageVersion('superml')
[1] ‘0.5.3’

max_features argument in TfIdfVectorizer

I would like to not set any max_features in TfIdfVectorizer function. In other words, I want to create a tdm with all word frequencies. However if I do not set max_features, the function takes very long time. I terminated it after 15 hours. The same corpus takes about 5 min in tm or quanteda packages. So how to create a tdm with max_features set to "none".


tfv <- TfIdfVectorizer$new(
  min_df=1,
  max_df=1,
  # max_features = 256,
  ngram_range= c(1, 1), 
  remove_stopwords = F, 
  lowercase = T,
  smooth_idf = T,
  norm = T
  # not defined in Python TfIdfVectorizer
  #split = " 
  #regex
  )

tf_mat <- tfv$fit_transform(imf_corpus)

tf_mat <- tfv$fit_transform(imf_corpus)

Allow y parameter to be a vector

  • y parameter of all ml models should be a vector

  • implement cv folds function

  • Fix default scoring in random/grid search for binary classificaition

List required packages as a dependency, don't try to install them automatically

I was running an example from the tutorial ( https://saraswatmks.github.io/superml/articles/Guide-to-TfidfVectorizer.html )
and it automatically tried to install xgboost.

It looks like the superml::check_package() tries to install a package if it's not found. That's risky and may cause other problems. If a package is required it should be listed as a required dependency so it is installed when the superml package is installed from CRAN.

I also came across an issue with a function not working because it needed the tm package.

Here's the log from my console:

> xgb <- XGBTrainer$new(n_estimators = 10, objective = "binary:logistic")
Installing package into ‘C:/Users/roe13/Documents/R/win-library/3.6’
(as ‘lib’ is unspecified)
Warning: unable to access index for repository http://cran.us.r-project.org/src/contrib:
  cannot open URL 'http://cran.us.r-project.org/src/contrib/PACKAGES'
Warning: unable to access index for repository http://cran.us.r-project.org/bin/windows/contrib/3.6:
  cannot open URL 'http://cran.us.r-project.org/bin/windows/contrib/3.6/PACKAGES'
Finished installing.
Warning messages:
1: In superml::check_package("xgboost") :
  Require Package xgboostfor this function to
                       work. Installing it.
2: package ‘xgboost’ is not available (for R version 3.6.1) 

Bugs Tracker

  • Doesn't trim spaces in ngrams in CountVectorizer

cannot find SVMTrainer

Hi @saraswatmks , thank you for the smooth package.
I noticed that the SVMTrainer is no longer in the R folder. Do you remove it for certain purpose?

Thank you!

Bug in SuperML BM_25 in SQL Server

Hello,
I tried to use BM_25 (R-Version) inside the SQL Server 2022 and I notice a bug.

If I use BM_25 in SQL Server then I got only one column as output (score) but not the corresponding docs.

ALTER PROC reports.PROC_2                   
as 

 DECLARE @Rscript NVARCHAR(MAX) = N'

 library(superml)
 
 docs <- c("Kaufmann", "test")

 sentence <- "Kaufmann"

 s <- bm_25(document = sentence, corpus = docs, top_n=10)

 OutputDataSet <- as.data.frame(s)

 OutputDataSet
 
 ' ;

EXEC sp_execute_external_script                      @language     = N'R',
     @script       = @Rscript
GO

EXEC reports.PROC_2 'Kaufmann'  
(No column name)
0,693147180559945
0

If I execute the same code in VS Code then I got two columns (docs, score) as output. Very weird. Do someone understand why this happen?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.