aksnzhy / xlearn Goto Github PK

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Home Page: https://xlearn-doc.readthedocs.io/en/latest/index.html

License: Apache License 2.0

CMake 0.81% C++ 82.71% Makefile 2.16% Shell 7.74% M4 0.43% C 0.55% Python 5.44% R 0.07% Batchfile 0.07% Dockerfile 0.02%

machine-learning statistics data-science data-analysis factorization-machines ffm fm

xlearn's People

Contributors

Stargazers

Watchers

Forkers

zhangtf2017 hahjing wataegg skyformat99 xswang quxiaofeng yanghui15 xkzju dakeli zldeng bityangke answer1992 wangkhun feiwofeifeixiaowo yanshanjing tomzhang burness paojianghu bayesquant jarlene mr-xingxing sancyzuo jeremiedb raghavendranpm linecode hanlos hengqujushi opmusic huzhiliang zgcgreat fandywang jxlijunhao yitang shangyingao clustersdata cymotiffany lucky8young pipilulu nanshawn cshaoping sli1989 chengduozhao jq xiaomaohoujiao2 liuzhenglian cosecant-csc yamlong zmxdream amusi jasondmuscut yztok33 guokr1991 bear2015 awesome-resources ollyja yangjun1994 hangtongluo smilesouth gucasbrg tongming beijinggao chengguobiao wishinger-li wongkinger tandychao algoding yunstanford zhxwmessi duhangnju haojunyu jdeking14 qintucoding xwolfs rajathjain 19ai lianhuiwang mutual-ai quanjie7 hbcbh1999 futuremac btbujiangjun foreverruri hj3938 actank liujisi marquisthunder kdjyss linuxerwang coolboydq sprinterzzj 924leichen yxzf mlubinsky gdtm86 anpark robgraeber dsivaji shafiahmed jeremyzhang866 tangal0203

xlearn's Issues

The small test file output is not between 0 and 1?

I run the scripts in the install md file:
./xlearn_train ./small_train.txt -v ./small_test.txt -s 2
./xlearn_predict ./small_test.txt ./small_train.txt.model

but I check the output file small_test.txt.out, I find the outputs are not in range 0 and 1.
1 -1.64431
2 -0.561951
3 -0.789286
4 -0.540316
5 -1.27843
6 -0.996344
7 -1.17249
8 -1.60885
9 -0.73878
10 -1.19467
Is there any other param that I can add to solve the problem? I also run the codes in python, but got the same output. I also run the libffm with the same data, but got outputs between 0 and 1.
1 0.197887
2 0.363382
3 0.364715
4 0.3575
5 0.276505
6 0.332071
7 0.471544
8 0.177905
9 0.323288
10 0.232038

Please preserve copyright and license when distributing source code

It looks like some of the code in the base directory are copied from google code base.

For example:
https://github.com/aksnzhy/xlearn/blob/master/src/base/scoped_ptr.h vs
https://github.com/google/protobuf/blob/master/src/google/protobuf/stubs/scoped_ptr.h

Please preserve the copyright and do not change the license type.

Benhmarks

How to reproduce? Which dataset was used? Parameters used, rank?

make 报错

[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
cc1plus: 错误：无法识别的命令行选项“-std=c++11”
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] 错误 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] 错误 2
make: *** [all] 错误 2

infinity/NaN values in prediction

I built 5 model using a subset of the full training datasets, and then predict on the test dataset. as you see, I got infinity/NaN values. what could possibly be the reason?

I could provide all the data if needed.

Thanks,

CV in xlearn

Hi:

Do you know what the default CV is in xlearn? Or is there any way of realizing K-fold CV in xlearn? Thanks!

Best,
Fenglin

Failed: pip install xlearn

pip install failed

thread problem

Compiling is OK, but when run './run_example.sh' in build dir, it throws an error:
'''
terminate called after throwing an instance of 'std::system_error'
what(): Enable multithreading to use std::thread: Operation not permitted
'''
modified CMakeLists.txt in root dir, line 33:
add
'''
-Wl,--no-as-needed -pthread
'''
follow https://stackoverflow.com/questions/17274032/c-threads-stdsystem-error-operation-not-permitted
but it doesn't work;

any hints?
thank you!

How to set early-stopping number in python param?

I want to set early-stoping number to 10, but I found there is no param for setting the early-stopping number.

Feeding numpy array python example

Is it possible to feed numpy array to the model?

I can't find example here:
https://github.com/aksnzhy/xlearn/blob/master/doc/python_package.md

training failed when the data size is big enough

[root@emr-header-1:/root/source/xlearn/build 2017-12-06 17:45:39]#./xlearn_train ./small_train.txt -v ./small_test.txt -s 2
----------------------------------------------------------------------------------------------
           _
          | |
     __  _| |     ___  __ _ _ __ _ __
     \ \/ / |    / _ \/ _` | '__| '_ \ 
      >  <| |___|  __/ (_| | |  | | | |
     /_/\_\_____/\___|\__,_|_|  |_| |_|

        xLearn   -- 0.10 Version --
----------------------------------------------------------------------------------------------

[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (./small_train.txt.bin) NOT found. Convert text file to binary file.
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (./small_test.txt.bin) NOT found. Convert text file to binary file.
[------------] Number of Feature: 9991
[------------] Number of Field: 18
[------------] Time cost for reading problem: 0.01 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 5.56 MB
[------------] Time cost for model initial: 0.00 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss     Time cost (sec)
[   10%      ]     1            0.595659            0.533602                0.03
[   20%      ]     2            0.542846            0.531371                0.03
[   30%      ]     3            0.522697            0.529272                0.03
[   40%      ]     4            0.504781            0.538329                0.03
[   50%      ]     5            0.490954            0.527123                0.03
[   60%      ]     6            0.481927            0.531130                0.03
[   70%      ]     7            0.469584            0.539557                0.03
[ ACTION     ] Early-stopping at epoch 5
[ ACTION     ] Finish training and start to save model ...
[------------] Model file: ./small_train.txt.model
[------------] Time cost for saving model: 0.01 (sec)
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 0.26 (sec)


[root@emr-header-1:/root/source/xlearn/build 2017-12-06 17:47:11]# ./xlearn_train /root/data/avazu-site.tr -v  /root/data/avazu-site.t -s 2
----------------------------------------------------------------------------------------------
           _
          | |
     __  _| |     ___  __ _ _ __ _ __
     \ \/ / |    / _ \/ _` | '__| '_ \ 
      >  <| |___|  __/ (_| | |  | | | |
     /_/\_\_____/\___|\__,_|_|  |_| |_|

        xLearn   -- 0.10 Version --
----------------------------------------------------------------------------------------------

[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (/root/data/avazu-site.tr.bin) NOT found. Convert text file to binary file.
Aborted

the command of getting the data:

 wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.t.bz2
 wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.tr.bz2
 bzip2 -dk avazu-app.t.bz2 
 bzip2 -dk avazu-app.tr.bz2

Memory leak during disk reading on large dataset

Hi there,

Using master, it appears there's a memory leak when reading in the initial dataset. Specifying the --disk flag, it appears to happen when checking for the max feature size. It runs until OOM ends up killing it, which is after consuming around 105gbs of memory on this particular machine.

Our dataset of choice is 68Gbs in size, about 160 million samples, with 1 million features total (sparse). Happy to provide more information as needed.

Any thoughts?

install error!!! Exception: Please install CMake first

The cmake --version(version 3.5.1) is installed on my linux!
Why do I have this problem？？？

feature importance?

about fm model
do hava python API like "model.feature_importance()" in lightgbm or xgboost ?

Issues when compiling in OSX

System : OSX 10.13.1
compiler: Apple clang 9.0.0.9000038
cmake version: 3.9.6

cmake info:

-- The C compiler identification is AppleClang 9.0.0.9000038
-- The CXX compiler identification is AppleClang 9.0.0.9000038
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PythonInterp: /Users/Fido/anaconda/envs/python3/bin/python (found version "3.6.2")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/Fido/workspace/xlearn/build

Error compile info :

Scanning dependencies of target gtest
[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
In file included from /Users/Fido/workspace/xlearn/gtest/src/gtest-all.cc:39:
In file included from /Users/Fido/workspace/xlearn/gtest/include/gtest/gtest.h:55:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:138:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ios:216:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__locale:15:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/string:470:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/string_view:171:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__string:56:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:640:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:629:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/typeinfo:61:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/exception:82:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/cstdlib:86:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/stdlib.h:94:
In file included from /usr/include/stdlib.h:65:
In file included from /usr/include/sys/wait.h:110:
/usr/include/sys/resource.h:196:2: error: unknown type name 'uint8_t'
uint8_t ri_uuid[16];
^
/usr/include/sys/resource.h:197:2: error: unknown type name 'uint64_t'
uint64_t ri_user_time;
^
/usr/include/sys/resource.h:198:2: error: unknown type name 'uint64_t'
uint64_t ri_system_time;
^
/usr/include/sys/resource.h:199:2: error: unknown type name 'uint64_t'
uint64_t ri_pkg_idle_wkups;
^
/usr/include/sys/resource.h:200:2: error: unknown type name 'uint64_t'
uint64_t ri_interrupt_wkups;
^
/usr/include/sys/resource.h:201:2: error: unknown type name 'uint64_t'
uint64_t ri_pageins;
^
/usr/include/sys/resource.h:202:2: error: unknown type name 'uint64_t'
uint64_t ri_wired_size;
^
/usr/include/sys/resource.h:203:2: error: unknown type name 'uint64_t'
uint64_t ri_resident_size;
^
/usr/include/sys/resource.h:204:2: error: unknown type name 'uint64_t'
uint64_t ri_phys_footprint;
^
/usr/include/sys/resource.h:205:2: error: unknown type name 'uint64_t'
uint64_t ri_proc_start_abstime;
^
/usr/include/sys/resource.h:206:2: error: unknown type name 'uint64_t'
uint64_t ri_proc_exit_abstime;
^
/usr/include/sys/resource.h:210:2: error: unknown type name 'uint8_t'
uint8_t ri_uuid[16];
^
/usr/include/sys/resource.h:211:2: error: unknown type name 'uint64_t'
uint64_t ri_user_time;
^
/usr/include/sys/resource.h:212:2: error: unknown type name 'uint64_t'
uint64_t ri_system_time;
^
/usr/include/sys/resource.h:213:2: error: unknown type name 'uint64_t'
uint64_t ri_pkg_idle_wkups;
^
/usr/include/sys/resource.h:214:2: error: unknown type name 'uint64_t'
uint64_t ri_interrupt_wkups;
^
/usr/include/sys/resource.h:215:2: error: unknown type name 'uint64_t'
uint64_t ri_pageins;
^
/usr/include/sys/resource.h:216:2: error: unknown type name 'uint64_t'
uint64_t ri_wired_size;
^
/usr/include/sys/resource.h:217:2: error: unknown type name 'uint64_t'
uint64_t ri_resident_size;
^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] Error 2
make: *** [all] Error 2

It seems like that stdint.h has not been loaded thus I take a look at /usr/include/sys/resource.h and see state below:

#if __DARWIN_C_LEVEL >= __DARWIN_C_FULL
#include <stdint.h>
#endif /* __DARWIN_C_LEVEL >= __DARWIN_C_FULL */

but I don't know How to solve this problem. Any idea?

cv score

Hwo to know cv score without looking in the console / terminal? I'm using jupyter notebook

Thanks

support of csv format

the documentation says xlearn supports csv format but i got this error:

why the model.out content is messy code

sorry, I think I make a mistake, everything is OK

coding style

现有代码有些像遵守 Google C++ Code Style，但很多地方又不完全一致，可以考虑严格遵守一个好的规范，比如就是 Google C++ Code Style（ http://google.github.io/styleguide/cppguide.html ）

install Xlearn

Why did I fail to install Xlearn?Error:"Could not find a version that satisfies the requirement xlearn (from versions: )
No matching distribution found for xlearn"

Segmentation fault in python FFM predict

I've trained a model and saved it to disk. When I try to make predictions, I get a segmentation fault like below:

[ ACTION     ] Load model ...
[------------] Load model from artifacts/xlfmrec/model-ffm-best-val.out
[------------] Loss function: cross-entropy
[------------] Score function: ffm
[------------] Number of Feature: 359966
[------------] Number of K: 8
[------------] Number of field: 14
[------------] Time cost for loading model: 0.12 (sec)
[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (artifacts/xlfmrec/feats-tst-batch-1.txt.bin) NOT found. Convert text file to binary file.
[------------] Time cost for reading problem: 0.00 (sec)
[ ACTION     ] Start to predict ...
Segmentation fault

The python code for loading and trying to make the prediction:

ffm_model = xl.create_ffm()
ffm_model.setTest('test.txt')
ffm_model.setSigmoid()
ffm_model.predict(model_path, 'out.txt')

Here are the first few lines of my test file. In total it has 2556765 lines.

0       0:1:1   1:1:1   2:2:1   6:4278:1        3:179:1 4:2044:1        5:6:1   11:15897:1      8:7223:1        9:0:1   13:7:1  12:5:1
0       0:1:1   1:1:1   2:2:1   6:328:1 3:11:1  4:10:1  7:150:1 5:4:1   11:15897:1      8:7223:1        9:0:1   13:7:1  12:5:1
0       0:3:1   2:5:1   6:170162:1      3:1379:1        4:7085:1        7:4239:1        5:5:1   11:9030:1       8:3870:1        9:0:1   13:7:1  12:5:1
0       0:4:1   1:7:1   2:6:1   6:98783:1       3:3009:1        4:9289:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:133246:1      3:7370:1        4:828:1 5:63:1  11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:57621:1       3:242:1 5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:10008:1       3:939:1 4:4144:1        7:2608:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:1:1   1:1:1   2:1:1   6:6011:1        3:1080:1        4:389:1 7:777:1 5:6:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:4152:1        3:335:1 4:1982:1        7:1224:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:1:1   1:1:1   2:2:1   6:5740:1        3:143:1 5:4:1   11:4776:1       8:1965:1        9:0:1   13:2:1  12:2:1

max feature count in one line

Seems there is a limit in featureid count in one line in sample in training file.
max size is 10000?

[ ACTION ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (train.problem.5.bin) NOT found. Convert text file to binary file.

Ubuntu16.04 install-python.sh

When I install the python environment,Hints “ImportError: No module named setuptools” where "from setuptools import setup, find_packages".
The problem is “sudo python setup.py install” Instead “python setup.py install”，Missing library at this time,So you need to copy “libxlearn.so” to the current directory.install-python.sh can be modified as follows:

#!/bin/bash
cp ../build/libxlearn.so .
python setup.py install

install error

[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
c++: cannot specify -o with -c or -S with multiple files
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] Error 2
make: *** [all] Error 2
/usr/bin/python2
Traceback (most recent call last):
File "setup.py", line 15, in
LIB_PATH = [os.path.relpath(libfile, CURRENT_DIR) for libfile in libpath'find_lib_path']
File "xlearn/libpath.py", line 45, in find_lib_path
'Cannot find xlearn Library in the candidate path'
builtin.XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path

ERROR: 'void xLearn::DMatrix::Compress(std::vector<unsigned int>&)':

I have tried using gcc version of 4.8,4.9,7 and 6...All failed to install xlearn?
Not sure why?

lemma@lemma:/xlearn$ mkdir build && cd build && cmake ..
-- The C compiler identification is GNU 6.3.0
-- The CXX compiler identification is GNU 5.4.1
-- Check for working C compiler: /usr/bin/gcc
-- Check for working C compiler: /usr/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/g++
-- Check for working CXX compiler: /usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PythonInterp: /home/lemma/anaconda2/bin/python (found version "2.7.14")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home/lemma/xlearn/build
lemma@lemma:/xlearn/build$ make
Scanning dependencies of target gtest
[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[ 2%] Linking CXX static library libgtest.a
[ 2%] Built target gtest
Scanning dependencies of target gtest_main
[ 3%] Building CXX object gtest/CMakeFiles/gtest_main.dir/src/gtest_main.cc.o
[ 4%] Linking CXX static library libgtest_main.a
[ 4%] Built target gtest_main
Scanning dependencies of target base
[ 5%] Building CXX object src/base/CMakeFiles/base.dir/logging.cc.o
[ 6%] Building CXX object src/base/CMakeFiles/base.dir/stringprintf.cc.o
[ 7%] Building CXX object src/base/CMakeFiles/base.dir/split_string.cc.o
[ 8%] Building CXX object src/base/CMakeFiles/base.dir/levenshtein_distance.cc.o
[ 9%] Building CXX object src/base/CMakeFiles/base.dir/timer.cc.o
[ 10%] Linking CXX static library libbase.a
[ 10%] Built target base
Scanning dependencies of target thread_pool_test
[ 11%] Building CXX object src/base/CMakeFiles/thread_pool_test.dir/thread_pool_test.cc.o
[ 12%] Linking CXX executable ../../test/base/thread_pool_test
[ 12%] Built target thread_pool_test
Scanning dependencies of target file_util_test
[ 13%] Building CXX object src/base/CMakeFiles/file_util_test.dir/file_util_test.cc.o
[ 14%] Linking CXX executable ../../test/base/file_util_test
[ 14%] Built target file_util_test
Scanning dependencies of target levenshtein_distance_test
[ 15%] Building CXX object src/base/CMakeFiles/levenshtein_distance_test.dir/levenshtein_distance_test.cc.o
[ 16%] Linking CXX executable ../../test/base/levenshtein_distance_test
[ 16%] Built target levenshtein_distance_test
Scanning dependencies of target data
[ 17%] Building CXX object src/data/CMakeFiles/data.dir/model_parameters.cc.o
In file included from /home/lemma/xlearn/src/data/model_parameters.h:31:0,
from /home/lemma/xlearn/src/data/model_parameters.cc:23:
/home/lemma/xlearn/src/data/data_structure.h: In member function 'void xLearn::DMatrix::Compress(std::vector&)':
/home/lemma/xlearn/src/data/data_structure.h:293:5: error: 'sort' is not a member of 'std'
std::sort(begin(feature_list), end(feature_list));
^
make[2]: *** [src/data/CMakeFiles/data.dir/model_parameters.cc.o] Error 1
make[1]: *** [src/data/CMakeFiles/data.dir/all] Error 2
make: *** [all] Error 2

AttributeError: 'module' object has no attribute 'create_lr'

我git下来过后发现ffm和fm都可以用，但是lr不能用，这是为啥啊？是这一块我没安装好的原因？

Export trained models

It would be very useful if trained models can be exported using PMML.

install error

i try to build xlearn with source code

-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.4.6
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found PythonInterp: /usr/bin/python (found version "2.7.11")
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /data/webroot/blancyin/xlearn-master/build
Scanning dependencies of target gtest
[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
cc1plus: error: unrecognized command line option "-mbmi-mavx"
cc1plus: error: unrecognized command line option "-std=c++11"
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] Error 2
make: *** [all] Error 2
/usr/bin/python2
Traceback (most recent call last):
File "setup.py", line 15, in
LIB_PATH = [os.path.relpath(libfile, CURRENT_DIR) for libfile in libpath'find_lib_path']
File "xlearn/libpath.py", line 45, in find_lib_path
'Cannot find xlearn Library in the candidate path'
builtin.XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path

does not support L1-regular?

How to get the latent and parameter of feature from the file "model.out" ?

It is a great work.

install issue for python interface

Hi:
I've successfully built exectuable and lib/ folder. Then I go to python-package/ folder under build/ folder and run install-python.sh. And it shows everything is OK.
However, when I tried to run "test_python.py", it gave me an error " module 'xlearn' has no attribute 'create_ffm'". It seems the installed python package failed to find the correct python module. Do you have any solution for this issue?

Thanks!
Fenglin

R-package installation failure

I tried to install the R package as instructed in makeR.sh but did not succeed. I am using Mac and my R version is 3.3.3. The error messages I got are:

Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared object '/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so':
dlopen(/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so, 6): Symbol not found: _Z11XLearnFit_RP7SEXPRECS0
Referenced from: ~/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so
Expected in: flat namespace
in ~/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so
Error: loading failed
Execution halted
ERROR: loading failed

I tried to compile it with clang/gcc-4.9 and also linux platform but all failed. Any suggestions on how to fix this issue? Thanks a lot.

安装报错

Cannot find xlearn Library in the candidate path

xlearn.libpath.XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path

python包安装后，跑个demo 报错

How to install?

How to install? Is there something like sudo pip install xlearn?

欢迎XLearn加入稀疏数据算法框架家族，与LightCTR一起成长

欢迎路过的朋友也关注下类似的CTR预估框架LightCTR

A light-weight framework that combines mainstream algorithms of Click-Through-Rate prediction Based Machine Learning and Deep Learning.
Github: https://github.com/cnkuangshi/LightCTR

感谢XLearn作者对开源社区的贡献，愿有机会交流。

Does xlearn support the number of multithreading?

like param "nthread" in xgboost
I don't want to be full of CPU every time

Recommendation for encoding many binary features

Is there a recommendation for encoding a categorical variable as a very large number of binary features?

For example, in the Kaggle KKBox Recommendation Challenge there is an artist_name field. Consider these two options for encoding this feature:

artist_name is a categorical feature that can take one of many values, e.g. artist_name:99. This seems to be the most obvious encoding.
Each of the possible artist_name values is a binary feature, e.g. artist_99:1.

The second option has the advantage that it can handle cases where a single song has multiple artists. For example, artist_name = artist 1 | artist 2 | artist 3 becomes [artist_1:1, artist_2:1, artist_3:1]. However, this would also mean you have potentially > 10K features. I have not gotten to try this encoding yet, and there is a non-trivial effort in feature engineering to test it.

I don't see any inherent limitation in the FM formulation preventing the second option from working, but is there any limitation in the Xlearn software that would prevent? Would it be much slower, bad performance, etc.? Any other thoughts?

Illegal instruction (core dumped)

Whatever I change python2 or 3, it raise the error.
The code is completely same as the examples, and I am sure that the libffm format data is correct(for I have test other libffm-pakages).

My cpu is Intel(R) Xeon(R) CPU X5675 @ 3.07GHz

When does this project support pairwise training and label weighting?

Add sklearn interface for better usability

The current XLearn model interface might impact usability since some of us would firstly do feature engineering using numpy/pandas. Extra steps are needed to convert the numpy matrix into libsvm/csv and call the fit/predict method in xlearn.

I have implemented a prototype of sklearn interface that utilizes the existing XLearn class and borrows some ideas from sklearn.py in xgboost. Please see https://github.com/randxie/xlearn/blob/master/python-package/xlearn/sklearn.py and the example https://github.com/randxie/xlearn/blob/master/demo/sklearn/example_FM_iris.py for more details. And please let me know if you like the idea or not. We can try to work together to improve the usability of xlearn.

problem about early-stop

[------------] Epoch Train log_loss Test log_loss Test AUC Time cost (sec)
[ 1% ] 1 0.615199 0.614211 0.632410 19.29
[ 2% ] 2 0.614572 0.613964 0.632814 19.08
[ 3% ] 3 0.614475 0.613909 0.632656 19.21
[ 4% ] 4 0.614376 0.614003 0.632342 19.09
[ 5% ] 5 0.614315 0.614300 0.633588 19.18
[ ACTION ] Early-stopping at epoch 3
[ ACTION ] Start to save model ...

param {'lr':0.01, 'lambda':0.002,'metric':'auc','epoch':100}

why early stop in 3 not 5 ?

How to set the epoch number?

param = { 'task':'reg',
'lr' : 1e-1,
#'lambda' : 0.002,
'epoch':100,
'metric' : 'rmse' }
I just want to set the epoch number to 100, then the program output a error:
_check_call(_LIB(XLearnSetInt(ctypes.byref(self.handle),
NameError: global name 'XLearnSetInt' is not defined

xLearn Command Line Guide

I don't know what went wrong.When I enter "./xlearn_train ./small_train.txt", it comes up "[ WARNING ] Validation file not found, xLearn has already disable early-stopping."

cmake version > 3.0 ?

when I install from pip ,
Exception: Please install CMake first
but I the centOS has cmake 2.8.12.2

when I build from source code
./build.sh
CMake Error at CMakeLists.txt:24 (cmake_minimum_required):
CMake 3.0 or higher is required. You are running version 2.8.12.2

can have a simple way install xlearn ?
or have a simple way update cmake to 3.0

xlearn from Juyter notebook

Hi. I'm trying to run xlearn's demo python code from Jupyter notebook.
The code runs with no problem from python interpreter. However when I try to run it from the notebook, the notebook's kernel crashes on the linear_model.fit(param, './model.out') line with the following error :
Kernel Restarting
The kernel appears to have died. It will restart automatically.

Before that line, the following code ran without any problem
import xlearn as xl

# Training task
linear_model = xl.create_linear() # Use linear model
linear_model.setTrain("./agaricus_train.txt") # Training data
linear_model.setValidate("./agaricus_test.txt") # Validation data

# param:
# 0. Binary classification
# 1. learning rate: 0.2
# 2. lambda: 0.002
# 3. evaluation metric: accuarcy
# 4. Use sgd optimization method
param = {'task':'binary', 'lr':0.2,
'lambda':0.002, 'metric':'acc',
'opt':'sgd'}

Can you please advise, what can be the problem and how to fix it?
I use xlearn version '0.2.0', which I cloned from git.

Thank you,
Zaven.

Can xlearn support saving model in text mode ?

Actually, xlearn is actually high performance and easy-to-use. But the model is always serialized in binary mode. Can xlearn support saving model in text mode ? thus we can use it anywhere.

install error

sudo pip install xlearn

Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 102, in
url='https://github.com/aksnzhy/xlearn')
File "/usr/lib64/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/site-packages/wheel/bdist_wheel.py", line 215, in run
self.run_command('install')
File "/usr/lib64/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 74, in run
compile_cpp();
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 60, in compile_cpp
silent_call(cmake_cmd, raise_error=True, error_msg='Please install CMake first')
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 26, in silent_call
raise Exception(error_msg);
Exception: Please install CMake first

but i have installed cmske, cmake version 3.10.0

Negative AUC with large number of test samples and FFM

Hi, I'm trying to fit the FFM model with 5.9 million training samples and 1.5 million test samples.

If I assign more than about 100K samples for testing I see negative AUC values.

For example, with 100K testing samples it's normal:

[------------] Number of Feature: 359966
[------------] Number of Field: 14
[------------] Time cost for reading problem: 7.71 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 464.13 MB
[------------] Time cost for model initial: 0.69 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss            Test AUC     Time cost (sec)
[   20%      ]     1            0.634412            0.676192            0.604832               24.80
[   40%      ]     2            0.621513            0.672281            0.612237               24.77
[   60%      ]     3            0.607376            0.664812            0.622405               24.58
[   80%      ]     4            0.591081            0.660661            0.630569               23.11
[  100%      ]     5            0.579556            0.657870            0.635688               24.40
[ ACTION     ] Finish training and start to save model ...
[------------] Model file: model.out
[------------] Time cost for saving model: 1.56 (sec)
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 132.76 (sec)

But with 200K samples, I see negative AUC values:

[------------] Number of Feature: 359966
[------------] Number of Field: 14
[------------] Time cost for reading problem: 10.12 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 464.13 MB
[------------] Time cost for model initial: 0.71 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss            Test AUC     Time cost (sec)
[   20%      ]     1            0.634381            0.670924           -3.245600               23.65
[   40%      ]     2            0.621243            0.665583           -3.163935               27.52
^C^C^C^C^C^C[   60%      ]     3            0.606576            0.658818           -3.054138               25.22
[   80%      ]     4            0.590273            0.655266           -2.969608               28.78
[  100%      ]     5            0.578989            0.653103           -2.918939               28.29
[ ACTION     ] Finish training and start to save model ...
[------------] Model file: model.out
[------------] Time cost for saving model: 2.12 (sec)
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 147.99 (sec)

And if I add all 1.5 million test samples, the values are much more negative.

[------------] Number of Feature: 359966
[------------] Number of Field: 14
[------------] Time cost for reading problem: 21.44 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 464.13 MB
[------------] Time cost for model initial: 0.75 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss            Test AUC     Time cost (sec)
[    1%      ]     1            0.634459            0.669086         -489.920593               31.13
[    2%      ]     2            0.621340            0.663582         -477.539825               25.53
[    3%      ]     3            0.606978            0.656649         -462.629272               23.70
[    4%      ]     4            0.590852            0.652189         -451.603271               22.88
[    5%      ]     5            0.579603            0.650198         -444.887390               21.49

What's interesting is that the Test log_loss continues to decrease. Maybe it's a type overflow error in the metric calculation?

Correct way to represent missing data

Hi, can you tell me what's the correct way to represent missing values in the libsvm or libffm formats? For example, if you had a feature gender with values male, female or missing, how would you represent the missing value?

aksnzhy / xlearn Goto Github PK

xlearn's People

Contributors

Stargazers

Watchers

Forkers

xlearn's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs