saraswatmks / superml Goto Github PK
View Code? Open in Web Editor NEWBuild machine learning models in R like using python's scikit-learn library
Home Page: https://saraswatmks.github.io/superml/
License: GNU General Public License v3.0
Build machine learning models in R like using python's scikit-learn library
Home Page: https://saraswatmks.github.io/superml/
License: GNU General Public License v3.0
In next iteration, count vectorizer should have following features:
Change max_feature parameter to count rather than percentage
Check for stopwords and remove them if needed
Clean tokens (remove punctuations, lemmatize etc.)
Add tokenize method = 'word' or 'char' (to split at word level)
Hi Administrator,
I was using the superml package. I have trained the TfIdfVectorizer and was able to transform it on the training set with no problem; however, when I try to transform the testing set, I end up getting the same TfIdf matrix as the training set.
In addition, to this error, II ran into a bug trying to install the latest version on Windows. When I try to install this package, I get this error. However, it works perfectly fine, when I try to install this on a Mac.
C:/Rtools/mingw_64/x86_64-w64-mingw32/include/c++/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
#error This file requires compiler and library support for the
^
utils.cpp: In function 'Rcpp::CharacterVector superSplit(std::string, char)':
utils.cpp:15:25: error: 'move' is not a member of 'std'
elems.push_back(std::move(item));
^
utils.cpp: In function 'std::vector<std::basic_string > superNgrams(std::string, Rcpp::NumericVector, char)':
utils.cpp:45:68: error: '>>' should be '> >' within a nested template argument list
std::vectorstd::string rx = as<std::vector>(r);
^
utils.cpp: In function 'std::vector<std::basic_string > superTokenizer(std::vector<std::basic_string >)':
utils.cpp:63:9: warning: 'auto' changes meaning in C++11; please remove it [-Wc++0x-compat]
for(auto i: string){
^
utils.cpp:63:14: error: 'i' does not name a type
for(auto i: string){
^
utils.cpp:72:5: error: expected ';' before 'return'
return output;
^
utils.cpp:72:5: error: expected primary-expression before 'return'
utils.cpp:72:5: error: expected ';' before 'return'
utils.cpp:72:5: error: expected primary-expression before 'return'
utils.cpp:72:5: error: expected ')' before 'return'
utils.cpp: In function 'Rcpp::NumericMatrix superCountMatrix(std::vector<std::basic_string >, std::vector<std::basic_string >)':
utils.cpp:82:20: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int i=0; i < sent.size(); i++){
^
utils.cpp:85:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for(int j=0; j < tokens.size(); j++){
^
utils.cpp:86:13: error: 'regex' was not declared in this scope
regex e = std::regex("\b" + tokens[j] + "\b");
^
utils.cpp:87:42: error: 'e' was not declared in this scope
string m = regex_replace (s, e, "");
^
utils.cpp:87:47: error: 'regex_replace' was not declared in this scope
string m = regex_replace (s, e, "");
^
utils.cpp: In function 'std::vector<std::basic_string > superTokenizer(std::vector<std::basic_string >)':
utils.cpp:74:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
make: *** [C:/PROGRA1/MICROS3/ROPEN1/R-351.3/etc/x64/Makeconf:215: utils.o] Error 1
ERROR: compilation failed for package 'superml'
Can you please help? Thanks.
Best,
jonyclee
I have the following example
# should be a vector of texts
sents <- c('i, am, going, home, and, home',
'where, are, you , going.? //// ',
'how, does, it, work')
cfv <- CountVectorizer$new(max_features = 10, remove_stopwords = FALSE, split = ", )
# generate the matrix
cf_mat <- cfv$fit_transform(sents)
head(cf_mat, 3)
As you can see after executing it, it doesn't split on the comma sign, but splits on space again.
Is this a bug? Would a Pull Request be welcome?
Thanks in advance!
Hello there,
I am trying to run GridSearchCV, but I am getting the following error:
Error in if (cuts < 2) cuts <- 2 : missing value where TRUE/FALSE needed
the commands that I am trying to run are the following:
rf <- RFTrainer$new()
gridSearch <- GridSearchCV$new(rf,
parameters = list(n_estimators = c(10, 20, 30, 50, 70, 80, 100, 120, 150, 180, 200, 220, 280, 320),
max_features = c("auto", "sqrt", "log2"),
oob_score=c(TRUE, FALSE), bootstrap=c(TRUE, FALSE),
class_weight = c("None", "balanced"), criterion=c("gini", "entropy")))
gridSearch$fit(train, "data.cl")
Note: I tried that on the Iris dataset, and it also returns the same error.
So what do you suggest?
Thank in advance.
I can't install the package via CRAN and had to install with devtools
. Is there a plan to put it back to CRAN?
Hi,
Thanks for using data.table. Since superml uses data.table, I check it in revdep testing of data.table. superml is in error status on CRAN currently and for me locally due to liquidSVM
being removed from CRAN. No rush : just to make sure you're aware.
Also needed
can be removed in this line :
R/super_utils.R: stop(paste0("Need Package " , pkg, "needed for this function to work. Please install it."),
$ more 00check.log
* using log directory ‘/home/mdowle/build/revdeplib/superml.Rcheck’
* using R version 3.6.0 (2019-04-26)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘superml/DESCRIPTION’ ... OK
* checking extension type ... Package
* this is package ‘superml’ version ‘0.3.0’
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... NOTE
Package suggested but not available for checking: ‘liquidSVM’
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘superml’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking ‘build’ directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking contents of ‘data’ directory ... OK
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking installed files from ‘inst/doc’ ... OK
* checking files in ‘vignettes’ ... OK
* checking examples ... ERROR
Running examples in ‘superml-Ex.R’ failed
The error most likely occurred in:
> ### Name: SVMTrainer
> ### Title: Support Vector Machines Trainer
> ### Aliases: SVMTrainer
> ### Keywords: datasets
>
> ### ** Examples
>
> data(iris)
> ## Multiclassification
> svm <- SVMTrainer$new(type="mc")
Error: Need Package liquidSVMneeded for this function to work. Please install it.
Execution halted
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ... OK
Running ‘test_miscs.R’
Running ‘testthat.R’
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking running R code from vignettes ... NONE
* checking re-building of vignette outputs ... WARNING
Error(s) in re-building vignettes:
...
--- re-building ‘introduction.Rmd’ using rmarkdown
Quitting from lines 110-115 (introduction.Rmd)
Error: processing vignette 'introduction.Rmd' failed with diagnostics:
Need Package liquidSVMneeded for this function to work. Please install it.
--- failed re-building ‘introduction.Rmd’
SUMMARY: processing the following file failed:
‘introduction.Rmd’
Error: Vignette re-building failed.
Execution halted
* checking PDF version of manual ... OK
* DONE
Status: 1 ERROR, 1 WARNING, 1 NOTE
A question about how to use quantile regression in SVM? I know this package can achieve support vector quantile regression by controlling the parameter type as qt, but there are some problems and I don't know how to achieve this, could you give me an example about SQVR in R.
Running this example in from the tutorial
as shown below, it looks like TfIdfVectorizer$transform() doesn't do anything with the supplied arguements since bothtrain_tf_features
and test_tf_features
are identical or this isn't a great example since there's no difference in the output.
Here's the example from the tutorial along with my additional code showing that they are identical.
library(data.table)
library(superml)
# use sents from above
sents <- c('i am going home and home',
'where are you going.? //// ',
'how does it work',
'transform your work and go work again',
'home is where you go from to work',
'how does it work')
# create dummy data
train <- data.table(text = sents, target = rep(c(0,1), 3))
test <- data.table(text = sample(sents), target = rep(c(0,1), 3))
# initialise the class, set parallel to TRUE for fast computation
tfv <- TfIdfVectorizer$new(min_df = 0.3, remove_stopwords = FALSE, ngram_range = c(1,3), parallel = FALSE)
# we fit on train data
tfv$fit(train$text)
train_tf_features <- tfv$transform(train$text)
test_tf_features <- tfv$transform(test$text)
#WTF, they're the same
identical(train_tf_features, test_tf_features)
Hi. First of all, this is a great idea. I build ML models for text analysis both in R and Python. It has been a headache to translate Python code in R. This package is a great help. I looked up the CountVectorizer() function and I was wondering whether you plan to add ngram_range argument to it. Thanks.
The installation for the package goes in error with the following msgs:
Warning in install.packages :
cannot open URL 'https://cran.rstudio.com/bin/macosx/contrib/4.2/superml_0.5.5.tgz': HTTP status was '404 Not Found'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://cran.rstudio.com/bin/macosx/contrib/4.2/superml_0.5.5.tgz'
Warning in install.packages :
download of package ‘superml’ failed
svm = SVMTrainer$new()
Error: object 'SVMTrainer' not found
packageVersion('superml')
[1] ‘0.5.3’
Kindly , I tried use this package with LSSVM , but could not do it, please could you help ?
Can I add the lssvm file to this project from :
add lssvm to this part of the code
available_trainers = c("XGBTrainer",
"RFTrainer",
"NBTrainer"),
I would like to not set any max_features in TfIdfVectorizer function. In other words, I want to create a tdm with all word frequencies. However if I do not set max_features, the function takes very long time. I terminated it after 15 hours. The same corpus takes about 5 min in tm or quanteda packages. So how to create a tdm with max_features set to "none".
tfv <- TfIdfVectorizer$new(
min_df=1,
max_df=1,
# max_features = 256,
ngram_range= c(1, 1),
remove_stopwords = F,
lowercase = T,
smooth_idf = T,
norm = T
# not defined in Python TfIdfVectorizer
#split = "
#regex
)
tf_mat <- tfv$fit_transform(imf_corpus)
tf_mat <- tfv$fit_transform(imf_corpus)
y parameter of all ml models should be a vector
implement cv folds function
Fix default scoring in random/grid search for binary classificaition
I was running an example from the tutorial ( https://saraswatmks.github.io/superml/articles/Guide-to-TfidfVectorizer.html )
and it automatically tried to install xgboost.
It looks like the superml::check_package() tries to install a package if it's not found. That's risky and may cause other problems. If a package is required it should be listed as a required dependency so it is installed when the superml package is installed from CRAN.
I also came across an issue with a function not working because it needed the tm package.
Here's the log from my console:
> xgb <- XGBTrainer$new(n_estimators = 10, objective = "binary:logistic")
Installing package into ‘C:/Users/roe13/Documents/R/win-library/3.6’
(as ‘lib’ is unspecified)
Warning: unable to access index for repository http://cran.us.r-project.org/src/contrib:
cannot open URL 'http://cran.us.r-project.org/src/contrib/PACKAGES'
Warning: unable to access index for repository http://cran.us.r-project.org/bin/windows/contrib/3.6:
cannot open URL 'http://cran.us.r-project.org/bin/windows/contrib/3.6/PACKAGES'
Finished installing.
Warning messages:
1: In superml::check_package("xgboost") :
Require Package xgboostfor this function to
work. Installing it.
2: package ‘xgboost’ is not available (for R version 3.6.1)
Hi @saraswatmks , thank you for the smooth package.
I noticed that the SVMTrainer is no longer in the R folder. Do you remove it for certain purpose?
Thank you!
Hello,
I tried to use BM_25 (R-Version) inside the SQL Server 2022 and I notice a bug.
If I use BM_25 in SQL Server then I got only one column as output (score) but not the corresponding docs.
ALTER PROC reports.PROC_2
as
DECLARE @Rscript NVARCHAR(MAX) = N'
library(superml)
docs <- c("Kaufmann", "test")
sentence <- "Kaufmann"
s <- bm_25(document = sentence, corpus = docs, top_n=10)
OutputDataSet <- as.data.frame(s)
OutputDataSet
' ;
EXEC sp_execute_external_script @language = N'R',
@script = @Rscript
GO
EXEC reports.PROC_2 'Kaufmann'
(No column name)
0,693147180559945
0
If I execute the same code in VS Code then I got two columns (docs, score) as output. Very weird. Do someone understand why this happen?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.