GithubHelp home page GithubHelp logo

lenskit / lkpy Goto Github PK

View Code? Open in Web Editor NEW
257.0 8.0 59.0 4.22 MB

Python recommendation toolkit

Home Page: https://lkpy.lenskit.org

License: MIT License

Python 99.68% Just 0.32%
recsys python scikit lenskit

lkpy's Introduction

Python recommendation tools

Test Suite codecov Maintainability

LensKit is a set of Python tools for experimenting with and studying recommender systems. It provides support for training, running, and evaluating recommender algorithms in a flexible fashion suitable for research and education.

LensKit for Python (LKPY) is the successor to the Java-based LensKit project.

Important

If you use LensKit for Python in published research, please cite:

Michael D. Ekstrand. 2020. LensKit for Python: Next-Generation Software for Recommender Systems Experiments. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20). DOI:10.1145/3340531.3412778. arXiv:1809.03125 [cs.IR].

Warning

This is the main branch of LensKit, following new development in preparation for the 2024 release. It will be changing frequently and incompatibly. You probably want to use a stable release

Installing

To install the current release with Anaconda (recommended):

conda install -c conda-forge lenskit

Or you can use pip:

pip install lenskit

To use the latest development version, install directly from GitHub:

pip install -U git+https://github.com/lenskit/lkpy

Then see Getting Started

Developing

To contribute to LensKit, clone or fork the repository, get to work, and submit a pull request. We welcome contributions from anyone; if you are looking for a place to get started, see the [issue tracker][].

Our development workflow is documented in the wiki; the wiki also contains other information on developing LensKit. User-facing documentation is at https://lkpy.lenskit.org.

We recommend using an Anaconda environment for developing LensKit. We don't maintain the Conda environment specification directly - instead, we maintain information in pyproject.toml to be able to generate it, so that we define dependencies and versions in one place.

conda-lock can help you set up the environment; the LensKit build tools automate this.

# install bootstrap enviroinment
conda env create -n lkboot -f https://raw.githubusercontent.com/lenskit/lkbuild/main/boot-env.yml
# create the lock file for Python 3.10
conda run -n lkboot --no-capture lkbuild dev-lock -v 3.10
# create the environment
conda env create -n lkpy -f conda-linux-64.lock

This will create a Conda environment called lkpy with the packages required to develop and test LensKit.

Testing Changes

You should always test your changes by running the LensKit test suite:

python -m pytest

If you want to use your changes in a LensKit experiment, you can locally install your modified LensKit into your experiment's environment. We recommend using separate environments for LensKit development and for each experiment; you will need to install the modified LensKit into your experiment's repository:

conda activate my-exp
conda install -c conda-forge flit
cd /path/to/lkpy
flit install --pth-file --deps none

You may need to first uninstall LensKit from your experiment repo; make sure that LensKit's dependencies are all still installed.

Once you have pushed your code to a GitHub branch, you can use a Git repository as a Pip dependency in an environment.yml for your experiment, to keep using the correct modified version of LensKit until your changes make it in to a release.

Resources

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. IIS 17-51278. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

lkpy's People

Contributors

azure-pipelines[bot] avatar carlos10seg avatar dependabot[bot] avatar keener101 avatar kimuratian avatar lukas-wegmeth avatar mdekstrand avatar reppertj avatar shwetanshusingh avatar teej avatar ziyaowei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lkpy's Issues

Add run-time feature training for ALS

The ALS algorithms are readily amenable to computing new or updated feature vectors for new users on the fly: just solve the least squares problem for the user's feature vector given their ratings and the item feature matrix.

We just need to write the code, and possibly factor out part of the solver to allow single-vector solutions. This is probably most robust with the LU-decomposition solver, although we could run the coordinate descent solver for a few rounds.

Unlike #60, this pertains to what we do at predict or recommend time, not updating the model itself.

Implement fit_transform API on Bias

It would makes sense for the Bias model to implement transform and fit_transform for bulk-normalizing ratings.

We might want to enable it to support sparse matrices, and for sparse_ratings to support pre-defined user and item indices.

Support partial_fit API when sensible

We should support the partial_fit API to update recommender models when it makes sense

  • ALS (#60)
  • Bias models
  • Popularity model
  • User-user
  • Unrated item candidate selector
  • Pass-through on TopN and Fallback

Codecov migration to marketplace app

Hi, Tom from Codecov here.

We noticed that you are using Codecov with fairly high frequency, and we’re so excited to see that! However, because you are not using our app, you may have experienced issues with uploading reports or viewing coverage information. This is due to rate-limiting issues from GitHub.

In order to prevent any future outages, we ask that you move over to our GitHub app integration.

The process is extremely simple and shouldn’t require more than a few clicks, and you should not expect any downtime. By moving to our app, you will no longer need an admin or separate account to manage the relationship with GitHub as the team bot.

Let me know if you have any questions, or if I can help at all with this process.

Support incremental updates to ALS

Incrementally updating ALS is, in principle, easy:

  1. For a new item, solve for the item feature row.
  2. For a new user, solve for the user feature row.
  3. For new ratings, run a small number of rounds of the solver (maybe just one).

(3) needs a little work - do we need to update all users/items, or just affected ones? We need to store the old ratings matrix, not just the preference matrices, because we need the old data too.

I would propose not updating biases, but letting them get stale.

Set up random seeds in subprocesses

Right now, the random seed infra doesn't propagate to subprocesses with train_isolated (or predict/recommend subprocesses, but those are less important).

nprocs seems to cause issues.

nprocs = 12 and nprocs=24 both caused fatal crashes, very quickly into the MultiEval process. We should investigate closer.

image

Support new user ratings across LensKit

We need to support new user ratings across LensKit, at least for every algorithm where it makes sense.

  • Bias
  • Item-Item
  • UnratedItemsCandidateSelector
  • Popular (indirectly supported through the candidate selector)
  • ALS Explicit (#114)
  • ALS Implicit (#114)
  • SKL SVD
  • FunkSVD - we probably don't actually want to do this, just document that it isn't supported and raise a warning
  • implicit BPR - maybe
  • implicit ALS - maybe

duplicate ratings give error in item_knn

I used a dataset that contains duplicate ratings (re-ratings). item_knn gave an error on line 329:
assert np.sum(np.logical_not(np.isnan(rate_v))) == len(ri_pos)

removing the duplicate ratings from the dataset solved the issue.

Run CI on OSX

Extend the CircleCI configuration to test on OSX.

Documentation ALS

the documentation of the ALS algorithm is not clear to me.
I needed to look into the code to see that the class is BiasedMF
and that 50 (based on the getting started) is the number of features

Add option to drop P for ALS

Once we have #114, there may be use cases in which we always have new ratings. If runtime user training is fast enough, then we can save memory by not storing the user-feature matrix in memory.

This should be controlled with a save_user_features parameter to the constructors that defaults to True.

Use concurrent instead of joblib for batch parallelism

I would like to use concurrent.ProcessPoolExecutor instead of joblib (+ loky) for batch parallelism, so that we can use initializers instead of hacky cache logic to load shared models. Joblib is great, but we're working around it too much.

Add recommend-and-measure API

Right now, we only support measuring top-N recommendation lists that have been pre-generated. For unlimited-length lists on large data files, this is infeasible.

We need a 'recommend-and-measure' API that computes metrics over recommendation lists without needing to store the lists in memory. There are two ways we could do this:

  • add a method to RecListAnalysis
  • add a batch.recommend_and_measure function

Add generalized neighborhood models

There have been a couple of inquiries on the mailing list for generalized k-NN recommenders that support other similarity functions. Maybe support them?

Implement uniform support for setting random seeds

Best practice for setting random seeds in NumPy is to pass around Generator or RandomState, not rely on global seeds. We need to add support to LensKit for doing this in a consistent fashion.

  • Add a random utils module to manage obtaining a generator or random state
  • Allow a global LensKit RNG to be set
  • Add seed / state parameters to every randomized component to support defining seeds
  • Audit use of randomness to make sure all use is consistent
  • Document randomness architecture and requirements for new developers

Add utility function to log to notebook stderr

It'd be useful to get log output on stderr for progress reporting while running things in a notebook.

We should add a quick function to util that sets up an stderr on the lenskit logger (and optionally additional loggers?) and test that it makes log output show up in the notebook.

Support Dask in RecListAnalysis

The RecListAnalysis class should support analyzing recommendations in Dask data frames. It is less important to support Dask for truth.

Parallelize evaluations

Right now an individual evaluation cannot be parallelized (e.g. across test users). We need a good solution for this.

I would prefer that solution not be per-algorithm, but if that is the only way to do it efficiently, we can consider.

We can also consider parallelism that only works on *nix. While LKPY should work on Windows, we can have platform-specific performance optimizations.

Add option to forget user ratings or data

Some LensKit components, such as the k-NN recommenders, remember the user-item matrix for the sole purpose of looking up a user's ratings when they arrive.

Others, such as matrix factorization, remember per-user data that can be reconstructed from user ratings.

For online scenarios where user ratings are being re-fetched from the database before each call to predict_for_user or recommend, we should support fit calls that do not remember the ratings matrix, or allow the ratings matrix to be forgotten before saving the model.

Algorithms that should be reviewed for this capability include:

  • ALS Explicit (after #114, covered by #126)
  • ALS Implicit (after #114, covered by #126)
  • SKL SVD

Use UUIDs for run IDs

It would be useful to use UUIDs for run IDs, instead of numbers, so that it's easier to merge results from multiple runs.

We should keep run numbers around for human readability, but have UUIDs be the main thing.

Use Joblib for parallelism

Right now our parallelism is ad-hoc. We should standardize on Joblib, possibly sharing data structures through Plasma.

Getting Started code not executable

It is not possible to use the Getting Started notebook as is. The following line:

fittable = util.clone(algo)

generates the following error:

AttributeError: module 'lenskit.util' has no attribute 'clone'

This statement from the documentation is also incorrect:

"LensKit algorithms are compatible with SciKit clone, however, so feel free to use that if you need more general capabilities."

Attempting to clone LK algorithms with sklearn.base.clone() provides the following error:

TypeError: Cannot clone object '<lenskit.algorithms.item_knn.ItemItem object at 0x61df95c50>' (type <class 'lenskit.algorithms.item_knn.ItemItem'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

Deadlock on Python 3.6 on macOS

The tests do not successfully finish on Python 3.6 on macOS - suspect a deadlock.

This seems to be a deadlock in multiprocessing.Pool, which is known to deadlock in some cases.

Numba dependency incorrect

We are seeing run failures in certain Google Collab environments - a numba typing failure.

Reproducer code:

from lenskit.datasets import MovieLens
from lenskit.algorithms.user_knn import UserUser
from lenskit.algorithms import Recommender

ml = MovieLens('ml-latest-small')
uu = UserUser(30)
uu = Recommender.adapt(uu)
uu.fit(ml.ratings)

Removing the Recommender.adapt call removes the error.

openmp enabled linux wheels with cibuildwheel

(continuing conversation from twitter - since I couldn't all this in a tweet).

I think the example travis config file will almost work for you from cibuildwheel - unfortunately I don't know enough about circleci to help out with setting that up =(.

I tested cibuildwheel for lkpy by building wheels using docker w/ cibuildwheel (I built on OSX, but everything here should work on linux).

After installing docker, and making sure docker is running - the commands to generate the wheels were:

pip install cibuildwheel
git clone [email protected]:lenskit/lkpy.git
cd lkpy

# python 3.7 needs hdf5 headers, this will install them in the docker container
# likewise python 3.5 needed numpy installed first for some reason
export CIBW_BEFORE_BUILD="yum install -y hdf5-devel; pip install numpy"

# I only tested building wheels for python 3.5/3.6
# It doesn't seem like setup.py is valid syntax for python2.7/3.3/3.4 since you are
# relying on this feature  https://www.python.org/dev/peps/pep-0448/ on line 72 which
# was added in python 3.5.
# Python 3.7 didn't work ... but probably can be made to work without too much effort
# setting this environment variable will cause cibuildwheel to skip those versions
export CIBW_SKIP='cp27* cpy33* cp34* cp37*'

cibuildwheel --output-dir wheelhouse --platform linux

This dumps out a series of wheel files into a 'wheelhouse' subdirectory:

lkpy ben$ ls wheelhouse/
lenskit-0.1.0-cp35-cp35m-manylinux1_i686.whl
lenskit-0.1.0-cp35-cp35m-manylinux1_x86_64.whl
lenskit-0.1.0-cp36-cp36m-manylinux1_i686.whl
lenskit-0.1.0-cp36-cp36m-manylinux1_x86_64.whl

I took one of them and installed on a ubuntu system and it worked with multi-threaded training and auditwheel didn't flag anything crazy to me or return an error:


>$ auditwheel show lenskit-0.1.0-cp36-cp36m-manylinux1_x86_64.whl

lenskit-0.1.0-cp36-cp36m-manylinux1_x86_64.whl is consistent with the
following platform tag: "manylinux1_x86_64".

The wheel references external versioned symbols in these system-
provided shared libraries: librt.so.1 with versions {'GLIBC_2.2.5'},
libpthread.so.0 with versions {'GLIBC_2.2.5', 'GLIBC_2.3.4'},
libc.so.6 with versions {'GLIBC_2.3', 'GLIBC_2.2.5', 'GLIBC_2.4'},
libgomp-3300acd3.so.1.0.0 with versions {'GOMP_1.0', 'OMP_1.0'}

The following external shared libraries are required by the wheel:
{
    "libc.so.6": "/lib/x86_64-linux-gnu/libc-2.23.so",
    "libpthread.so.0": "/lib/x86_64-linux-gnu/libpthread-2.23.so",
    "librt.so.1": "/lib/x86_64-linux-gnu/librt-2.23.so"
}

It does seem like cibuildwheel is doing something to install libgomp with your package, the linked libgomp package is installed with the lenskit package (to site-packages/lenskit/.lib/libgomp...)

>$ ldd ~/anaconda3/lib/python3.6/site-packages/lenskit/algorithms/_item_knn.cpython-36m-x86_64-linux-gnu.so
        linux-vdso.so.1 =>  (0x00007ffcb168d000)
        libgomp-3300acd3.so.1.0.0 => /home/ben/anaconda3/lib/python3.6/site-packages/lenskit/algorithms/../.libs/libgomp-3300acd3.so.1.0.0 (0x00007f349359d000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3493380000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3492fb6000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3492dae000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f34939e6000)

The error I saw from python 3.7 was complaining about the installed hdf5 version being too old (dependency of tables that was getting installed). It shouldn't be too hard to hack around

I've been mainly using cibuildwheel for rust extensions so far - I wrote up how to do this here: https://www.benfrederickson.com/writing-python-extensions-in-rust-using-pyo3/ , the relevant bits to you are at the end (and talk about setting up appveyor/travis to automatically upload wheels to pypi for you whenever the version is bumped).

Anyways hope this helps =) I find all this python packaging stuff to be a bit of a pain myself ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.