GithubHelp home page GithubHelp logo

aksnzhy / xlearn Goto Github PK

View Code? Open in Web Editor NEW
3.1K 110.0 519.0 10.92 MB

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Home Page: https://xlearn-doc.readthedocs.io/en/latest/index.html

License: Apache License 2.0

CMake 0.81% C++ 82.71% Makefile 2.16% Shell 7.74% M4 0.43% C 0.55% Python 5.44% R 0.07% Batchfile 0.07% Dockerfile 0.02%
machine-learning statistics data-science data-analysis factorization-machines ffm fm

xlearn's Introduction

Hex.pm Project Status

What is xLearn?

xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM), all of which can be used to solve large-scale machine learning problems. xLearn is especially useful for solving machine learning problems on large-scale sparse data. Many real world datasets deal with high dimensional sparse feature vectors like a recommendation system where the number of categories and users is on the order of millions. In that case, if you are the user of liblinear, libfm, and libffm, now xLearn is your another better choice.

Get Started! (English)

Get Started! (中文)

Performance

xLearn is developed by high-performance C++ code with careful design and optimizations. Our system is designed to maximize CPU and memory utilization, provide cache-aware computation, and support lock-free learning. By combining these insights, xLearn is 5x-13x faster compared to similar systems.

Ease-of-use

xLearn does not rely on any third-party library and users can just clone the code and compile it by using cmake. Also, xLearn supports very simple Python and CLI interface for data scientists, and it also offers many useful features that have been widely used in machine learning and data mining competitions, such as cross-validation, early-stop, etc.

Scalability

xLearn can be used for solving large-scale machine learning problems. xLearn supports out-of-core training, which can handle very large data (TB) by just leveraging the disk of a PC.

How to Contribute

xLearn has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

  • Please contribute if you find any bug in xLearn.
  • Contribute new features you want to see in xLearn.
  • Contribute to the tests to make it more reliable.
  • Contribute to the documents to make it clearer for everyone.
  • Contribute to the examples to share your experience with other users.
  • Open issue if you met problems during development.

Note that, please post iusse and contribution in English so that everyone can get help from them.

What's New

  • 2019-10-13 Andrew Kane add Ruby bindings for xLearn!

  • 2019-4-25 xLearn 0.4.4 version release. Main update:

    • Support Python DMatrix
    • Better Windows support
    • Fix bugs in previous version
  • 2019-3-25 xLearn 0.4.3 version release. Main update:

    • Fix bugs in previous version
  • 2019-3-12 xLearn 0.4.2 version release. Main update:

    • Release Windows version of xLearn
  • 2019-1-30 xLearn 0.4.1 version release. Main update:

    • More flexible data reader
  • 2018-11-22 xLearn 0.4.0 version release. Main update:

    • Fix bugs in previous version
    • Add online learning for xLearn
  • 2018-11-10 xLearn 0.3.8 version release. Main update:

    • Fix bugs in previous version.
    • Update early-stop mechanism.
  • 2018-11-08. xLearn gets 2000 star! Congs!

  • 2018-10-29 xLearn 0.3.7 version release. Main update:

    • Add incremental Reader, which can save 50% memory cost.
  • 2018-10-22 xLearn 0.3.5 version release. Main update:

    • Fix bugs in 0.3.4.
  • 2018-10-21 xLearn 0.3.4 version release. Main update:

    • Fix bugs in on-disk training.
    • Support new file format.
  • 2018-10-14 xLearn 0.3.3 version release. Main update:

    • Fix segmentation fault in prediction task.
    • Update early-stop meachnism.
  • 2018-09-21 xLearn 0.3.2 version release. Main update:

    • Fix bugs in previous version
    • New TXT format for model output
  • 2018-09-08 xLearn uses the new logo:

  • 2018-09-07 The Chinese document is available now!

  • 2018-03-08 xLearn 0.3.0 version release. Main update:

    • Fix bugs in previous version
    • Solved the memory leak problem for on-disk learning
    • Support TXT model checkpoint
    • Support Scikit-Learn API
  • 2017-12-18 xLearn 0.2.0 version release. Main update:

    • Fix bugs in previous version
    • Support pip installation
    • New Documents
    • Faster FTRL algorithm
  • 2017-11-24 The first version (0.1.0) of xLearn release !

xlearn's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xlearn's Issues

training failed when the data size is big enough

[root@emr-header-1:/root/source/xlearn/build 2017-12-06 17:45:39]#./xlearn_train ./small_train.txt -v ./small_test.txt -s 2
----------------------------------------------------------------------------------------------
           _
          | |
     __  _| |     ___  __ _ _ __ _ __
     \ \/ / |    / _ \/ _` | '__| '_ \ 
      >  <| |___|  __/ (_| | |  | | | |
     /_/\_\_____/\___|\__,_|_|  |_| |_|

        xLearn   -- 0.10 Version --
----------------------------------------------------------------------------------------------

[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (./small_train.txt.bin) NOT found. Convert text file to binary file.
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (./small_test.txt.bin) NOT found. Convert text file to binary file.
[------------] Number of Feature: 9991
[------------] Number of Field: 18
[------------] Time cost for reading problem: 0.01 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 5.56 MB
[------------] Time cost for model initial: 0.00 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss     Time cost (sec)
[   10%      ]     1            0.595659            0.533602                0.03
[   20%      ]     2            0.542846            0.531371                0.03
[   30%      ]     3            0.522697            0.529272                0.03
[   40%      ]     4            0.504781            0.538329                0.03
[   50%      ]     5            0.490954            0.527123                0.03
[   60%      ]     6            0.481927            0.531130                0.03
[   70%      ]     7            0.469584            0.539557                0.03
[ ACTION     ] Early-stopping at epoch 5
[ ACTION     ] Finish training and start to save model ...
[------------] Model file: ./small_train.txt.model
[------------] Time cost for saving model: 0.01 (sec)
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 0.26 (sec)


[root@emr-header-1:/root/source/xlearn/build 2017-12-06 17:47:11]# ./xlearn_train /root/data/avazu-site.tr -v  /root/data/avazu-site.t -s 2
----------------------------------------------------------------------------------------------
           _
          | |
     __  _| |     ___  __ _ _ __ _ __
     \ \/ / |    / _ \/ _` | '__| '_ \ 
      >  <| |___|  __/ (_| | |  | | | |
     /_/\_\_____/\___|\__,_|_|  |_| |_|

        xLearn   -- 0.10 Version --
----------------------------------------------------------------------------------------------

[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (/root/data/avazu-site.tr.bin) NOT found. Convert text file to binary file.
Aborted

the command of getting the data:

 wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.t.bz2
 wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.tr.bz2
 bzip2 -dk avazu-app.t.bz2 
 bzip2 -dk avazu-app.tr.bz2 

xLearn Command Line Guide

I don't know what went wrong.When I enter "./xlearn_train ./small_train.txt", it comes up "[ WARNING ] Validation file not found, xLearn has already disable early-stopping."

Negative AUC with large number of test samples and FFM

Hi, I'm trying to fit the FFM model with 5.9 million training samples and 1.5 million test samples.

If I assign more than about 100K samples for testing I see negative AUC values.

For example, with 100K testing samples it's normal:

[------------] Number of Feature: 359966
[------------] Number of Field: 14
[------------] Time cost for reading problem: 7.71 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 464.13 MB
[------------] Time cost for model initial: 0.69 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss            Test AUC     Time cost (sec)
[   20%      ]     1            0.634412            0.676192            0.604832               24.80
[   40%      ]     2            0.621513            0.672281            0.612237               24.77
[   60%      ]     3            0.607376            0.664812            0.622405               24.58
[   80%      ]     4            0.591081            0.660661            0.630569               23.11
[  100%      ]     5            0.579556            0.657870            0.635688               24.40
[ ACTION     ] Finish training and start to save model ...
[------------] Model file: model.out
[------------] Time cost for saving model: 1.56 (sec)
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 132.76 (sec)

But with 200K samples, I see negative AUC values:

[------------] Number of Feature: 359966
[------------] Number of Field: 14
[------------] Time cost for reading problem: 10.12 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 464.13 MB
[------------] Time cost for model initial: 0.71 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss            Test AUC     Time cost (sec)
[   20%      ]     1            0.634381            0.670924           -3.245600               23.65
[   40%      ]     2            0.621243            0.665583           -3.163935               27.52
^C^C^C^C^C^C[   60%      ]     3            0.606576            0.658818           -3.054138               25.22
[   80%      ]     4            0.590273            0.655266           -2.969608               28.78
[  100%      ]     5            0.578989            0.653103           -2.918939               28.29
[ ACTION     ] Finish training and start to save model ...
[------------] Model file: model.out
[------------] Time cost for saving model: 2.12 (sec)
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 147.99 (sec)

And if I add all 1.5 million test samples, the values are much more negative.

[------------] Number of Feature: 359966
[------------] Number of Field: 14
[------------] Time cost for reading problem: 21.44 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 464.13 MB
[------------] Time cost for model initial: 0.75 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train log_loss       Test log_loss            Test AUC     Time cost (sec)
[    1%      ]     1            0.634459            0.669086         -489.920593               31.13
[    2%      ]     2            0.621340            0.663582         -477.539825               25.53
[    3%      ]     3            0.606978            0.656649         -462.629272               23.70
[    4%      ]     4            0.590852            0.652189         -451.603271               22.88
[    5%      ]     5            0.579603            0.650198         -444.887390               21.49

What's interesting is that the Test log_loss continues to decrease. Maybe it's a type overflow error in the metric calculation?

ERROR: 'void xLearn::DMatrix::Compress(std::vector<unsigned int>&)':

I have tried using gcc version of 4.8,4.9,7 and 6...All failed to install xlearn?
Not sure why?

lemma@lemma:/xlearn$ mkdir build && cd build && cmake ..
-- The C compiler identification is GNU 6.3.0
-- The CXX compiler identification is GNU 5.4.1
-- Check for working C compiler: /usr/bin/gcc
-- Check for working C compiler: /usr/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/g++
-- Check for working CXX compiler: /usr/bin/g++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PythonInterp: /home/lemma/anaconda2/bin/python (found version "2.7.14")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home/lemma/xlearn/build
lemma@lemma:
/xlearn/build$ make
Scanning dependencies of target gtest
[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[ 2%] Linking CXX static library libgtest.a
[ 2%] Built target gtest
Scanning dependencies of target gtest_main
[ 3%] Building CXX object gtest/CMakeFiles/gtest_main.dir/src/gtest_main.cc.o
[ 4%] Linking CXX static library libgtest_main.a
[ 4%] Built target gtest_main
Scanning dependencies of target base
[ 5%] Building CXX object src/base/CMakeFiles/base.dir/logging.cc.o
[ 6%] Building CXX object src/base/CMakeFiles/base.dir/stringprintf.cc.o
[ 7%] Building CXX object src/base/CMakeFiles/base.dir/split_string.cc.o
[ 8%] Building CXX object src/base/CMakeFiles/base.dir/levenshtein_distance.cc.o
[ 9%] Building CXX object src/base/CMakeFiles/base.dir/timer.cc.o
[ 10%] Linking CXX static library libbase.a
[ 10%] Built target base
Scanning dependencies of target thread_pool_test
[ 11%] Building CXX object src/base/CMakeFiles/thread_pool_test.dir/thread_pool_test.cc.o
[ 12%] Linking CXX executable ../../test/base/thread_pool_test
[ 12%] Built target thread_pool_test
Scanning dependencies of target file_util_test
[ 13%] Building CXX object src/base/CMakeFiles/file_util_test.dir/file_util_test.cc.o
[ 14%] Linking CXX executable ../../test/base/file_util_test
[ 14%] Built target file_util_test
Scanning dependencies of target levenshtein_distance_test
[ 15%] Building CXX object src/base/CMakeFiles/levenshtein_distance_test.dir/levenshtein_distance_test.cc.o
[ 16%] Linking CXX executable ../../test/base/levenshtein_distance_test
[ 16%] Built target levenshtein_distance_test
Scanning dependencies of target data
[ 17%] Building CXX object src/data/CMakeFiles/data.dir/model_parameters.cc.o
In file included from /home/lemma/xlearn/src/data/model_parameters.h:31:0,
from /home/lemma/xlearn/src/data/model_parameters.cc:23:
/home/lemma/xlearn/src/data/data_structure.h: In member function 'void xLearn::DMatrix::Compress(std::vector&)':
/home/lemma/xlearn/src/data/data_structure.h:293:5: error: 'sort' is not a member of 'std'
std::sort(begin(feature_list), end(feature_list));
^
make[2]: *** [src/data/CMakeFiles/data.dir/model_parameters.cc.o] Error 1
make[1]: *** [src/data/CMakeFiles/data.dir/all] Error 2
make: *** [all] Error 2

install Xlearn

Why did I fail to install Xlearn?Error:"Could not find a version that satisfies the requirement xlearn (from versions: )
No matching distribution found for xlearn"

The small test file output is not between 0 and 1?

I run the scripts in the install md file:
./xlearn_train ./small_train.txt -v ./small_test.txt -s 2
./xlearn_predict ./small_test.txt ./small_train.txt.model

but I check the output file small_test.txt.out, I find the outputs are not in range 0 and 1.
1 -1.64431
2 -0.561951
3 -0.789286
4 -0.540316
5 -1.27843
6 -0.996344
7 -1.17249
8 -1.60885
9 -0.73878
10 -1.19467
Is there any other param that I can add to solve the problem? I also run the codes in python, but got the same output. I also run the libffm with the same data, but got outputs between 0 and 1.
1 0.197887
2 0.363382
3 0.364715
4 0.3575
5 0.276505
6 0.332071
7 0.471544
8 0.177905
9 0.323288
10 0.232038

feature importance?

about fm model
do hava python API like "model.feature_importance()" in lightgbm or xgboost ?

Ubuntu16.04 install-python.sh

When I install the python environment,Hints “ImportError: No module named setuptools” where "from setuptools import setup, find_packages".
The problem is “sudo python setup.py install” Instead “python setup.py install”,Missing library at this time,So you need to copy “libxlearn.so” to the current directory.install-python.sh can be modified as follows:

#!/bin/bash
cp ../build/libxlearn.so .
python setup.py install

install error

[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
c++: cannot specify -o with -c or -S with multiple files
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] Error 2
make: *** [all] Error 2
/usr/bin/python2
Traceback (most recent call last):
File "setup.py", line 15, in
LIB_PATH = [os.path.relpath(libfile, CURRENT_DIR) for libfile in libpath'find_lib_path']
File "xlearn/libpath.py", line 45, in find_lib_path
'Cannot find xlearn Library in the candidate path'
builtin.XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path

R-package installation failure

I tried to install the R package as instructed in makeR.sh but did not succeed. I am using Mac and my R version is 3.3.3. The error messages I got are:

Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared object '/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so':
dlopen(
/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so, 6): Symbol not found: _Z11XLearnFit_RP7SEXPRECS0
Referenced from: ~/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so
Expected in: flat namespace
in ~/xlearn/xlearn.Rcheck/xlearn/libs/xlearn.so
Error: loading failed
Execution halted
ERROR: loading failed

I tried to compile it with clang/gcc-4.9 and also linux platform but all failed. Any suggestions on how to fix this issue? Thanks a lot.

Correct way to represent missing data

Hi, can you tell me what's the correct way to represent missing values in the libsvm or libffm formats? For example, if you had a feature gender with values male, female or missing, how would you represent the missing value?

CV in xlearn

Hi:

Do you know what the default CV is in xlearn? Or is there any way of realizing K-fold CV in xlearn? Thanks!

Best,
Fenglin

How to install?

How to install? Is there something like sudo pip install xlearn?

Benhmarks

How to reproduce? Which dataset was used? Parameters used, rank?

Memory leak during disk reading on large dataset

Hi there,

Using master, it appears there's a memory leak when reading in the initial dataset. Specifying the --disk flag, it appears to happen when checking for the max feature size. It runs until OOM ends up killing it, which is after consuming around 105gbs of memory on this particular machine.

Our dataset of choice is 68Gbs in size, about 160 million samples, with 1 million features total (sparse). Happy to provide more information as needed.

Any thoughts?

Can xlearn support saving model in text mode ?

Actually, xlearn is actually high performance and easy-to-use. But the model is always serialized in binary mode. Can xlearn support saving model in text mode ? thus we can use it anywhere.

Illegal instruction (core dumped)

Whatever I change python2 or 3, it raise the error.
The code is completely same as the examples, and I am sure that the libffm format data is correct(for I have test other libffm-pakages).

My cpu is Intel(R) Xeon(R) CPU X5675 @ 3.07GHz

cmake version > 3.0 ?

when I install from pip ,
Exception: Please install CMake first
but I the centOS has cmake 2.8.12.2

when I build from source code
./build.sh
CMake Error at CMakeLists.txt:24 (cmake_minimum_required):
CMake 3.0 or higher is required. You are running version 2.8.12.2

can have a simple way install xlearn ?
or have a simple way update cmake to 3.0

make 报错

[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
cc1plus: 错误:无法识别的命令行选项“-std=c++11”
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] 错误 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] 错误 2
make: *** [all] 错误 2

cv score

Hwo to know cv score without looking in the console / terminal? I'm using jupyter notebook

Thanks

install error

sudo pip install xlearn

Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 102, in
url='https://github.com/aksnzhy/xlearn')
File "/usr/lib64/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/site-packages/wheel/bdist_wheel.py", line 215, in run
self.run_command('install')
File "/usr/lib64/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 74, in run
compile_cpp();
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 60, in compile_cpp
silent_call(cmake_cmd, raise_error=True, error_msg='Please install CMake first')
File "/tmp/pip-build-oBUPP0/xlearn/setup.py", line 26, in silent_call
raise Exception(error_msg);
Exception: Please install CMake first

but i have installed cmske, cmake version 3.10.0

Add sklearn interface for better usability

The current XLearn model interface might impact usability since some of us would firstly do feature engineering using numpy/pandas. Extra steps are needed to convert the numpy matrix into libsvm/csv and call the fit/predict method in xlearn.

I have implemented a prototype of sklearn interface that utilizes the existing XLearn class and borrows some ideas from sklearn.py in xgboost. Please see https://github.com/randxie/xlearn/blob/master/python-package/xlearn/sklearn.py and the example https://github.com/randxie/xlearn/blob/master/demo/sklearn/example_FM_iris.py for more details. And please let me know if you like the idea or not. We can try to work together to improve the usability of xlearn.

infinity/NaN values in prediction

I built 5 model using a subset of the full training datasets, and then predict on the test dataset. as you see, I got infinity/NaN values. what could possibly be the reason?

I could provide all the data if needed.

Thanks,

Yi
screenshot from 2017-11-26 11-33-23

screenshot from 2017-11-26 11-34-03

install error

i try to build xlearn with source code

-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.4.6
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found PythonInterp: /usr/bin/python (found version "2.7.11")
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /data/webroot/blancyin/xlearn-master/build
Scanning dependencies of target gtest
[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
cc1plus: error: unrecognized command line option "-mbmi-mavx"
cc1plus: error: unrecognized command line option "-std=c++11"
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] Error 2
make: *** [all] Error 2
/usr/bin/python2
Traceback (most recent call last):
File "setup.py", line 15, in
LIB_PATH = [os.path.relpath(libfile, CURRENT_DIR) for libfile in libpath'find_lib_path']
File "xlearn/libpath.py", line 45, in find_lib_path
'Cannot find xlearn Library in the candidate path'
builtin.XLearnLibraryNotFound: Cannot find xlearn Library in the candidate path

Issues when compiling in OSX

System : OSX 10.13.1
compiler: Apple clang 9.0.0.9000038
cmake version: 3.9.6

cmake info:

-- The C compiler identification is AppleClang 9.0.0.9000038
-- The CXX compiler identification is AppleClang 9.0.0.9000038
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PythonInterp: /Users/Fido/anaconda/envs/python3/bin/python (found version "3.6.2")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/Fido/workspace/xlearn/build

Error compile info :

Scanning dependencies of target gtest
[ 1%] Building CXX object gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
In file included from /Users/Fido/workspace/xlearn/gtest/src/gtest-all.cc:39:
In file included from /Users/Fido/workspace/xlearn/gtest/include/gtest/gtest.h:55:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ostream:138:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/ios:216:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__locale:15:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/string:470:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/string_view:171:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__string:56:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:640:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:629:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/typeinfo:61:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/exception:82:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/cstdlib:86:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/stdlib.h:94:
In file included from /usr/include/stdlib.h:65:
In file included from /usr/include/sys/wait.h:110:
/usr/include/sys/resource.h:196:2: error: unknown type name 'uint8_t'
uint8_t ri_uuid[16];
^
/usr/include/sys/resource.h:197:2: error: unknown type name 'uint64_t'
uint64_t ri_user_time;
^
/usr/include/sys/resource.h:198:2: error: unknown type name 'uint64_t'
uint64_t ri_system_time;
^
/usr/include/sys/resource.h:199:2: error: unknown type name 'uint64_t'
uint64_t ri_pkg_idle_wkups;
^
/usr/include/sys/resource.h:200:2: error: unknown type name 'uint64_t'
uint64_t ri_interrupt_wkups;
^
/usr/include/sys/resource.h:201:2: error: unknown type name 'uint64_t'
uint64_t ri_pageins;
^
/usr/include/sys/resource.h:202:2: error: unknown type name 'uint64_t'
uint64_t ri_wired_size;
^
/usr/include/sys/resource.h:203:2: error: unknown type name 'uint64_t'
uint64_t ri_resident_size;
^
/usr/include/sys/resource.h:204:2: error: unknown type name 'uint64_t'
uint64_t ri_phys_footprint;
^
/usr/include/sys/resource.h:205:2: error: unknown type name 'uint64_t'
uint64_t ri_proc_start_abstime;
^
/usr/include/sys/resource.h:206:2: error: unknown type name 'uint64_t'
uint64_t ri_proc_exit_abstime;
^
/usr/include/sys/resource.h:210:2: error: unknown type name 'uint8_t'
uint8_t ri_uuid[16];
^
/usr/include/sys/resource.h:211:2: error: unknown type name 'uint64_t'
uint64_t ri_user_time;
^
/usr/include/sys/resource.h:212:2: error: unknown type name 'uint64_t'
uint64_t ri_system_time;
^
/usr/include/sys/resource.h:213:2: error: unknown type name 'uint64_t'
uint64_t ri_pkg_idle_wkups;
^
/usr/include/sys/resource.h:214:2: error: unknown type name 'uint64_t'
uint64_t ri_interrupt_wkups;
^
/usr/include/sys/resource.h:215:2: error: unknown type name 'uint64_t'
uint64_t ri_pageins;
^
/usr/include/sys/resource.h:216:2: error: unknown type name 'uint64_t'
uint64_t ri_wired_size;
^
/usr/include/sys/resource.h:217:2: error: unknown type name 'uint64_t'
uint64_t ri_resident_size;
^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make[2]: *** [gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o] Error 1
make[1]: *** [gtest/CMakeFiles/gtest.dir/all] Error 2
make: *** [all] Error 2

It seems like that stdint.h has not been loaded thus I take a look at /usr/include/sys/resource.h and see state below:

#if __DARWIN_C_LEVEL >= __DARWIN_C_FULL
#include <stdint.h>
#endif /* __DARWIN_C_LEVEL >= __DARWIN_C_FULL */

but I don't know How to solve this problem. Any idea?

Segmentation fault in python FFM predict

I've trained a model and saved it to disk. When I try to make predictions, I get a segmentation fault like below:

[ ACTION     ] Load model ...
[------------] Load model from artifacts/xlfmrec/model-ffm-best-val.out
[------------] Loss function: cross-entropy
[------------] Score function: ffm
[------------] Number of Feature: 359966
[------------] Number of K: 8
[------------] Number of field: 14
[------------] Time cost for loading model: 0.12 (sec)
[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (artifacts/xlfmrec/feats-tst-batch-1.txt.bin) NOT found. Convert text file to binary file.
[------------] Time cost for reading problem: 0.00 (sec)
[ ACTION     ] Start to predict ...
Segmentation fault

The python code for loading and trying to make the prediction:

ffm_model = xl.create_ffm()
ffm_model.setTest('test.txt')
ffm_model.setSigmoid()
ffm_model.predict(model_path, 'out.txt')

Here are the first few lines of my test file. In total it has 2556765 lines.

0       0:1:1   1:1:1   2:2:1   6:4278:1        3:179:1 4:2044:1        5:6:1   11:15897:1      8:7223:1        9:0:1   13:7:1  12:5:1
0       0:1:1   1:1:1   2:2:1   6:328:1 3:11:1  4:10:1  7:150:1 5:4:1   11:15897:1      8:7223:1        9:0:1   13:7:1  12:5:1
0       0:3:1   2:5:1   6:170162:1      3:1379:1        4:7085:1        7:4239:1        5:5:1   11:9030:1       8:3870:1        9:0:1   13:7:1  12:5:1
0       0:4:1   1:7:1   2:6:1   6:98783:1       3:3009:1        4:9289:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:133246:1      3:7370:1        4:828:1 5:63:1  11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:57621:1       3:242:1 5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:10008:1       3:939:1 4:4144:1        7:2608:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:1:1   1:1:1   2:1:1   6:6011:1        3:1080:1        4:389:1 7:777:1 5:6:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:4152:1        3:335:1 4:1982:1        7:1224:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:1:1   1:1:1   2:2:1   6:5740:1        3:143:1 5:4:1   11:4776:1       8:1965:1        9:0:1   13:2:1  12:2:1

xlearn from Juyter notebook

Hi. I'm trying to run xlearn's demo python code from Jupyter notebook.
The code runs with no problem from python interpreter. However when I try to run it from the notebook, the notebook's kernel crashes on the linear_model.fit(param, './model.out') line with the following error :
Kernel Restarting
The kernel appears to have died. It will restart automatically.

Before that line, the following code ran without any problem
import xlearn as xl

# Training task
linear_model = xl.create_linear() # Use linear model
linear_model.setTrain("./agaricus_train.txt") # Training data
linear_model.setValidate("./agaricus_test.txt") # Validation data

# param:
# 0. Binary classification
# 1. learning rate: 0.2
# 2. lambda: 0.002
# 3. evaluation metric: accuarcy
# 4. Use sgd optimization method
param = {'task':'binary', 'lr':0.2,
'lambda':0.002, 'metric':'acc',
'opt':'sgd'}

Can you please advise, what can be the problem and how to fix it?
I use xlearn version '0.2.0', which I cloned from git.

Thank you,
Zaven.

max feature count in one line

Seems there is a limit in featureid count in one line in sample in training file.
max size is 10000?

[ ACTION ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (train.problem.5.bin) NOT found. Convert text file to binary file.

problem about early-stop

[------------] Epoch Train log_loss Test log_loss Test AUC Time cost (sec)
[ 1% ] 1 0.615199 0.614211 0.632410 19.29
[ 2% ] 2 0.614572 0.613964 0.632814 19.08
[ 3% ] 3 0.614475 0.613909 0.632656 19.21
[ 4% ] 4 0.614376 0.614003 0.632342 19.09
[ 5% ] 5 0.614315 0.614300 0.633588 19.18
[ ACTION ] Early-stopping at epoch 3
[ ACTION ] Start to save model ...
image

param {'lr':0.01, 'lambda':0.002,'metric':'auc','epoch':100}

why early stop in 3 not 5 ?

install issue for python interface

Hi:
I've successfully built exectuable and lib/ folder. Then I go to python-package/ folder under build/ folder and run install-python.sh. And it shows everything is OK.
However, when I tried to run "test_python.py", it gave me an error " module 'xlearn' has no attribute 'create_ffm'". It seems the installed python package failed to find the correct python module. Do you have any solution for this issue?

Thanks!
Fenglin

How to set the epoch number?

param = { 'task':'reg',
'lr' : 1e-1,
#'lambda' : 0.002,
'epoch':100,
'metric' : 'rmse' }
I just want to set the epoch number to 100, then the program output a error:
_check_call(_LIB(XLearnSetInt(ctypes.byref(self.handle),
NameError: global name 'XLearnSetInt' is not defined

Recommendation for encoding many binary features

Is there a recommendation for encoding a categorical variable as a very large number of binary features?

For example, in the Kaggle KKBox Recommendation Challenge there is an artist_name field. Consider these two options for encoding this feature:

  1. artist_name is a categorical feature that can take one of many values, e.g. artist_name:99. This seems to be the most obvious encoding.
  2. Each of the possible artist_name values is a binary feature, e.g. artist_99:1.

The second option has the advantage that it can handle cases where a single song has multiple artists. For example, artist_name = artist 1 | artist 2 | artist 3 becomes [artist_1:1, artist_2:1, artist_3:1]. However, this would also mean you have potentially > 10K features. I have not gotten to try this encoding yet, and there is a non-trivial effort in feature engineering to test it.

I don't see any inherent limitation in the FM formulation preventing the second option from working, but is there any limitation in the Xlearn software that would prevent? Would it be much slower, bad performance, etc.? Any other thoughts?

安装报错

Cannot find xlearn Library in the candidate path

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.