lenskit / lkpy Goto Github PK

View Code? Open in Web Editor NEW

257.0 8.0 59.0 4.22 MB

Python recommendation toolkit

Home Page: https://lkpy.lenskit.org

License: MIT License

Python 99.68% Just 0.32%

recsys python scikit lenskit

lkpy's Introduction

Python recommendation tools

LensKit is a set of Python tools for experimenting with and studying recommender systems. It provides support for training, running, and evaluating recommender algorithms in a flexible fashion suitable for research and education.

LensKit for Python (LKPY) is the successor to the Java-based LensKit project.

Important

If you use LensKit for Python in published research, please cite:

Michael D. Ekstrand. 2020. LensKit for Python: Next-Generation Software for Recommender Systems Experiments. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20). DOI:10.1145/3340531.3412778. arXiv:1809.03125 [cs.IR].

Warning

This is the main branch of LensKit, following new development in preparation for the 2024 release. It will be changing frequently and incompatibly. You probably want to use a stable release

Installing

To install the current release with Anaconda (recommended):

conda install -c conda-forge lenskit

Or you can use pip:

pip install lenskit

To use the latest development version, install directly from GitHub:

pip install -U git+https://github.com/lenskit/lkpy

Then see Getting Started

Developing

To contribute to LensKit, clone or fork the repository, get to work, and submit a pull request. We welcome contributions from anyone; if you are looking for a place to get started, see the [issue tracker][].

Our development workflow is documented in the wiki; the wiki also contains other information on developing LensKit. User-facing documentation is at https://lkpy.lenskit.org.

We recommend using an Anaconda environment for developing LensKit. We don't maintain the Conda environment specification directly - instead, we maintain information in pyproject.toml to be able to generate it, so that we define dependencies and versions in one place.

conda-lock can help you set up the environment; the LensKit build tools automate this.

# install bootstrap enviroinment
conda env create -n lkboot -f https://raw.githubusercontent.com/lenskit/lkbuild/main/boot-env.yml
# create the lock file for Python 3.10
conda run -n lkboot --no-capture lkbuild dev-lock -v 3.10
# create the environment
conda env create -n lkpy -f conda-linux-64.lock

This will create a Conda environment called lkpy with the packages required to develop and test LensKit.

Testing Changes

You should always test your changes by running the LensKit test suite:

python -m pytest

If you want to use your changes in a LensKit experiment, you can locally install your modified LensKit into your experiment's environment. We recommend using separate environments for LensKit development and for each experiment; you will need to install the modified LensKit into your experiment's repository:

conda activate my-exp
conda install -c conda-forge flit
cd /path/to/lkpy
flit install --pth-file --deps none

You may need to first uninstall LensKit from your experiment repo; make sure that LensKit's dependencies are all still installed.

Once you have pushed your code to a GitHub branch, you can use a Git repository as a Pip dependency in an environment.yml for your experiment, to keep using the correct modified version of LensKit until your changes make it in to a release.

Resources

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. IIS 17-51278. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

lkpy's People

Contributors

Stargazers

Watchers

lkpy's Issues

docs(readme): link to the Getting Started guide is broken

The link to https://lkpy.lenskit.org/en/latest/quickstart.html should be replaced with https://lkpy.lenskit.org/en/stable/GettingStarted.html

Rewrite docs for recommend=None and n=None for topn

Docs imply that n=None (batch.recommend) and recommend=None (multieval) will result in complete recommendation lists. This is not the case. Will try with n=0 / n=-1 and see if that works.

Add run-time feature training for ALS

The ALS algorithms are readily amenable to computing new or updated feature vectors for new users on the fly: just solve the least squares problem for the user's feature vector given their ratings and the item feature matrix.

We just need to write the code, and possibly factor out part of the solver to allow single-vector solutions. This is probably most robust with the LU-decomposition solver, although we could run the coordinate descent solver for a few rounds.

Unlike #60, this pertains to what we do at predict or recommend time, not updating the model itself.

fastparquet does not like compression='snappy'.

I had to uninstall fastparquet because it said compression='snappy' wasn't supported. Is this just on my end for some reason?

Implement fit_transform API on Bias

It would makes sense for the Bias model to implement transform and fit_transform for bulk-normalizing ratings.

We might want to enable it to support sparse matrices, and for sparse_ratings to support pre-defined user and item indices.

Evaluate switching to mkdocs for LensKit docs

mkdocs is native Markdown, and may be a better platform for our documentation.

Optimize item-item scoring in Cython

The item-item build is Cython-optimized, but scoring can be sped up with Cython as well.

Support partial_fit API when sensible

We should support the partial_fit API to update recommender models when it makes sense

Codecov migration to marketplace app

Hi, Tom from Codecov here.

We noticed that you are using Codecov with fairly high frequency, and we’re so excited to see that! However, because you are not using our app, you may have experienced issues with uploading reports or viewing coverage information. This is due to rate-limiting issues from GitHub.

In order to prevent any future outages, we ask that you move over to our GitHub app integration.

The process is extremely simple and shouldn’t require more than a few clicks, and you should not expect any downtime. By moving to our app, you will no longer need an admin or separate account to manage the relationship with GitHub as the team bot.

Let me know if you have any questions, or if I can help at all with this process.

Support incremental updates to ALS

Incrementally updating ALS is, in principle, easy:

For a new item, solve for the item feature row.
For a new user, solve for the user feature row.
For new ratings, run a small number of rounds of the solver (maybe just one).

(3) needs a little work - do we need to update all users/items, or just affected ones? We need to store the old ratings matrix, not just the preference matrices, because we need the old data too.

I would propose not updating biases, but letting them get stale.

Set up random seeds in subprocesses

Right now, the random seed infra doesn't propagate to subprocesses with train_isolated (or predict/recommend subprocesses, but those are less important).

nprocs seems to cause issues.

nprocs = 12 and nprocs=24 both caused fatal crashes, very quickly into the MultiEval process. We should investigate closer.

Run CI on Windows

We should do at least minimal CI on Windows with AppVeyor.

Support new user ratings across LensKit

We need to support new user ratings across LensKit, at least for every algorithm where it makes sense.

duplicate ratings give error in item_knn

I used a dataset that contains duplicate ratings (re-ratings). item_knn gave an error on line 329:
assert np.sum(np.logical_not(np.isnan(rate_v))) == len(ri_pos)

removing the duplicate ratings from the dataset solved the issue.

Run CI on OSX

Extend the CircleCI configuration to test on OSX.

Add shims for Surprise

Make MultiEval skip prediction by default when test has no ratings

When the test data has no rating column, the MultiEval should skip predictions by default, just like it does if the algorithm does not implement Predictor.

Documentation ALS

the documentation of the ALS algorithm is not clear to me.
I needed to look into the code to see that the class is BiasedMF
and that 50 (based on the getting started) is the number of features

Add option to drop P for ALS

Once we have #114, there may be use cases in which we always have new ratings. If runtime user training is fast enough, then we can save memory by not storing the user-feature matrix in memory.

This should be controlled with a save_user_features parameter to the constructors that defaults to True.

Use concurrent instead of joblib for batch parallelism

I would like to use concurrent.ProcessPoolExecutor instead of joblib (+ loky) for batch parallelism, so that we can use initializers instead of hacky cache logic to load shared models. Joblib is great, but we're working around it too much.

Add recommend-and-measure API

Right now, we only support measuring top-N recommendation lists that have been pre-generated. For unlimited-length lists on large data files, this is infeasible.

We need a 'recommend-and-measure' API that computes metrics over recommendation lists without needing to store the lists in memory. There are two ways we could do this:

add a method to RecListAnalysis
add a batch.recommend_and_measure function

Add generalized neighborhood models

There have been a couple of inquiries on the mailing list for generalized k-NN recommenders that support other similarity functions. Maybe support them?

Add shims for implicit

https://github.com/benfred/implicit

Add bridge for LK for Java recommenders

Use pyjnius to implement support for Java LensKit's recommenders in LKPY, to compare evaluation results.

Getting started code still not executable from conda install

Issue #56 is closed, but the problem still exists. The supplied Getting Started code fails at utils.clone()

See attached.

Conda thinks I have 0.5.0 installed.

When use pip and the gihub repository, I do not get the error.

Add shims for PyRecLab

Implement uniform support for setting random seeds

Best practice for setting random seeds in NumPy is to pass around Generator or RandomState, not rely on global seeds. We need to add support to LensKit for doing this in a consistent fashion.

Add a random utils module to manage obtaining a generator or random state
Allow a global LensKit RNG to be set
Add seed / state parameters to every randomized component to support defining seeds
Audit use of randomness to make sure all use is consistent
Document randomness architecture and requirements for new developers

Add utility function to log to notebook stderr

It'd be useful to get log output on stderr for progress reporting while running things in a notebook.

We should add a quick function to util that sets up an stderr on the lenskit logger (and optionally additional loggers?) and test that it makes log output show up in the notebook.

Support Dask in RecListAnalysis

The RecListAnalysis class should support analyzing recommendations in Dask data frames. It is less important to support Dask for truth.

RecListAnalysis overlooks missing users

If a user has no recommendations, but should, RecListAnalysis currently overlooks them.

Parallelize evaluations

Right now an individual evaluation cannot be parallelized (e.g. across test users). We need a good solution for this.

I would prefer that solution not be per-algorithm, but if that is the only way to do it efficiently, we can consider.

We can also consider parallelism that only works on *nix. While LKPY should work on Windows, we can have platform-specific performance optimizations.

Save models in experiments

MultiEval should have a save_models parameter that saves all trained models to disk.

user-user aggregate 'sum' doesn't actually sum

Right now, setting aggregate in UserKNN doesn't actually do anything.

Add movies to MovieLens 100K data set class

The ML100K class does not yet support movies. It should.

Add release notes to documentation

Add option to forget user ratings or data

Some LensKit components, such as the k-NN recommenders, remember the user-item matrix for the sole purpose of looking up a user's ratings when they arrive.

Others, such as matrix factorization, remember per-user data that can be reconstructed from user ratings.

For online scenarios where user ratings are being re-fetched from the database before each call to predict_for_user or recommend, we should support fit calls that do not remember the ratings matrix, or allow the ratings matrix to be forgotten before saving the model.

Algorithms that should be reviewed for this capability include:

ALS Explicit (after #114, covered by #126)
ALS Implicit (after #114, covered by #126)
SKL SVD

Create interface for item similarity computations

Many algorithms can compute item similarities.

We should expose interfaces for (1) computing similarities and (2) finding most-similar items.

Use UUIDs for run IDs

It would be useful to use UUIDs for run IDs, instead of numbers, so that it's easier to merge results from multiple runs.

We should keep run numbers around for human readability, but have UUIDs be the main thing.

Add default recommendation capabilities

Right now, the recommend APIs don't have any default candidate sets. Make one.

Use Joblib for parallelism

Right now our parallelism is ad-hoc. We should standardize on Joblib, possibly sharing data structures through Plasma.

Getting Started code not executable

It is not possible to use the Getting Started notebook as is. The following line:

fittable = util.clone(algo)

generates the following error:

AttributeError: module 'lenskit.util' has no attribute 'clone'

This statement from the documentation is also incorrect:

"LensKit algorithms are compatible with SciKit clone, however, so feel free to use that if you need more general capabilities."

Attempting to clone LK algorithms with sklearn.base.clone() provides the following error:

TypeError: Cannot clone object '<lenskit.algorithms.item_knn.ItemItem object at 0x61df95c50>' (type <class 'lenskit.algorithms.item_knn.ItemItem'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

Deadlock on Python 3.6 on macOS

The tests do not successfully finish on Python 3.6 on macOS - suspect a deadlock.

This seems to be a deadlock in multiprocessing.Pool, which is known to deadlock in some cases.

why can't I change the similarity in CF user-user recommender ?

similarity matrix parameter is available only in item-item, why can't see the same parameter with user-user as presented in the documentation ?

OpenMP item-item odoesn't work on macOS

The OpenMP-enabled item-item training on Macos fails with an Abort Trap 6.

Numba dependency incorrect

We are seeing run failures in certain Google Collab environments - a numba typing failure.

environment requirements.txt

Reproducer code:

from lenskit.datasets import MovieLens
from lenskit.algorithms.user_knn import UserUser
from lenskit.algorithms import Recommender

ml = MovieLens('ml-latest-small')
uu = UserUser(30)
uu = Recommender.adapt(uu)
uu.fit(ml.ratings)

Removing the Recommender.adapt call removes the error.

Use pickling instead of our own serialization logic

The standard SciKit-Learn paradigm is to use pickling. Right now we have our own save/load logic.

We can just pickle, so long as we can appropriately pickle a CSR.

Run CI in Conda environments

Extend the CircleCI builds to use Conda environments too.

Add RankALS algorithm

Since we have good capabilities for ALS, let's add the RankALS algorithm.

openmp enabled linux wheels with cibuildwheel

(continuing conversation from twitter - since I couldn't all this in a tweet).

I think the example travis config file will almost work for you from cibuildwheel - unfortunately I don't know enough about circleci to help out with setting that up =(.

I tested cibuildwheel for lkpy by building wheels using docker w/ cibuildwheel (I built on OSX, but everything here should work on linux).

After installing docker, and making sure docker is running - the commands to generate the wheels were:

pip install cibuildwheel
git clone [email protected]:lenskit/lkpy.git
cd lkpy

# python 3.7 needs hdf5 headers, this will install them in the docker container
# likewise python 3.5 needed numpy installed first for some reason
export CIBW_BEFORE_BUILD="yum install -y hdf5-devel; pip install numpy"

# I only tested building wheels for python 3.5/3.6
# It doesn't seem like setup.py is valid syntax for python2.7/3.3/3.4 since you are
# relying on this feature  https://www.python.org/dev/peps/pep-0448/ on line 72 which
# was added in python 3.5.
# Python 3.7 didn't work ... but probably can be made to work without too much effort
# setting this environment variable will cause cibuildwheel to skip those versions
export CIBW_SKIP='cp27* cpy33* cp34* cp37*'

cibuildwheel --output-dir wheelhouse --platform linux

This dumps out a series of wheel files into a 'wheelhouse' subdirectory:

lkpy ben$ ls wheelhouse/
lenskit-0.1.0-cp35-cp35m-manylinux1_i686.whl
lenskit-0.1.0-cp35-cp35m-manylinux1_x86_64.whl
lenskit-0.1.0-cp36-cp36m-manylinux1_i686.whl
lenskit-0.1.0-cp36-cp36m-manylinux1_x86_64.whl

I took one of them and installed on a ubuntu system and it worked with multi-threaded training and auditwheel didn't flag anything crazy to me or return an error:


>$ auditwheel show lenskit-0.1.0-cp36-cp36m-manylinux1_x86_64.whl

lenskit-0.1.0-cp36-cp36m-manylinux1_x86_64.whl is consistent with the
following platform tag: "manylinux1_x86_64".

The wheel references external versioned symbols in these system-
provided shared libraries: librt.so.1 with versions {'GLIBC_2.2.5'},
libpthread.so.0 with versions {'GLIBC_2.2.5', 'GLIBC_2.3.4'},
libc.so.6 with versions {'GLIBC_2.3', 'GLIBC_2.2.5', 'GLIBC_2.4'},
libgomp-3300acd3.so.1.0.0 with versions {'GOMP_1.0', 'OMP_1.0'}

The following external shared libraries are required by the wheel:
{
    "libc.so.6": "/lib/x86_64-linux-gnu/libc-2.23.so",
    "libpthread.so.0": "/lib/x86_64-linux-gnu/libpthread-2.23.so",
    "librt.so.1": "/lib/x86_64-linux-gnu/librt-2.23.so"
}

It does seem like cibuildwheel is doing something to install libgomp with your package, the linked libgomp package is installed with the lenskit package (to site-packages/lenskit/.lib/libgomp...)

>$ ldd ~/anaconda3/lib/python3.6/site-packages/lenskit/algorithms/_item_knn.cpython-36m-x86_64-linux-gnu.so
        linux-vdso.so.1 =>  (0x00007ffcb168d000)
        libgomp-3300acd3.so.1.0.0 => /home/ben/anaconda3/lib/python3.6/site-packages/lenskit/algorithms/../.libs/libgomp-3300acd3.so.1.0.0 (0x00007f349359d000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3493380000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3492fb6000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3492dae000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f34939e6000)

The error I saw from python 3.7 was complaining about the installed hdf5 version being too old (dependency of tables that was getting installed). It shouldn't be too hard to hack around

I've been mainly using cibuildwheel for rust extensions so far - I wrote up how to do this here: https://www.benfrederickson.com/writing-python-extensions-in-rust-using-pyo3/ , the relevant bits to you are at the end (and talk about setting up appveyor/travis to automatically upload wheels to pypi for you whenever the version is bumped).

Anyways hope this helps =) I find all this python packaging stuff to be a bit of a pain myself ...