GithubHelp home page GithubHelp logo

sberbank-ai-lab / replay Goto Github PK

View Code? Open in Web Editor NEW
60.0 5.0 6.0 12.98 MB

RecSys Library

Home Page: https://sberbank-ai-lab.github.io/RePlay/

License: Apache License 2.0

Shell 0.06% Python 99.94%
machine-learning recsys recommender-systems pyspark pytorch

replay's Introduction

RePlay

RePlay is a library providing tools for all stages of creating a recommendation system, from data preprocessing to model evaluation and comparison.

RePlay uses PySpark to handle big data.

You can

  • Filter and split data
  • Train models
  • Optimize hyper parameters
  • Evaluate predictions with metrics
  • Combine predictions from different models
  • Create a two-level model

Docs

Documentation

Installation

Use Linux machine with Python 3.7+, Java 8+ and C++ compiler.

pip install replay-rec

It is preferable to use a virtual environment for your installation.

If you encounter an error during RePlay installation, check the troubleshooting guide.

replay's People

Contributors

alexxl1986 avatar alside avatar darel13712 avatar monkey0head avatar roseaysina avatar shashist avatar shminke-ba avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

replay's Issues

cannot import name 'JavaClassificationModel' from 'pyspark.ml.classification'

  • Ubuntu 18.04.3 LTS (Bionic Beaver)
  • Python 3.8.5
  • pyspark==3.2.0
  • replay-rec==0.6.0

An exception occurs when I try to import RePlay models:

File "/home/rinchin/main.py", line 7, in <module>
    from replay.models import PopRec
  File "/home/rinchin/replay_venv/lib/python3.8/site-packages/replay/models/__init__.py", line 15, in <module>
    from replay.models.classifier_rec import ClassifierRec
  File "/home/rinchin/replay_venv/lib/python3.8/site-packages/replay/models/classifier_rec.py", line 26, in <module>
    from pyspark.ml.classification import (
ImportError: cannot import name 'JavaClassificationModel' from 'pyspark.ml.classification' 

NeuroMF bug

After indexing change negative generation for NeuroMF got a bug.
Before: negatives were sampled from [0, num_items) uniform distribution; for now it is from [0, max_item + 1) distribution.
Ids were consecutive before and now they are not, so we can sample absent ids, which can break a training or make it less efficient.
Here:

    def _get_neg_batch(self, batch: Tensor) -> Tensor:
        negative_items = torch.randint(
            0,
            self._item_dim,
            (batch.shape[0] * self.count_negative_sample,),
        )
        return negative_items

Make indexer skip renaming

Indexer will fail if columns are user_idx and item_idx, because it can't rename them to the same value.

Create indexer save/load functions

with indexers extracted we now need separate functions to save indexer. something like

def save_index(indexer, path: str):
    self.user_type = users.schema[self.user_col].dataType
    self.item_type = items.schema[self.item_col].dataType
    indexer.user_indexer.save(join(path, "user_indexer"))
    indexer.item_indexer.save(join(path, "item_indexer"))
    indexer.inv_user_indexer.save(join(path, "inv_user_indexer"))
    indexer.inv_item_indexer.save(join(path, "inv_item_indexer"))


def load_index(path: str):
    indexer = Indexer()
    model.user_indexer = StringIndexerModel.load(join(path, "user_indexer"))
    model.item_indexer = StringIndexerModel.load(join(path, "item_indexer"))
    model.inv_user_indexer = IndexToString.load(join(path, "inv_user_indexer"))
    model.inv_item_indexer = IndexToString.load(join(path, "inv_item_indexer"))

LightFM indexer bugs

  • test_predict cannot perform fit on this data, says {Exception}Number of user feature rows does not equal the number of users. Probably the problem is that user features do not contain all the users in log. Should investigate what is happening in _feature_table_to_csr. Maybe we should inner join log and features? Oo
  • test_predict_pairs and test_enrich_with_features probably fail on fit too

fallback scenario indexers issue

In addition to #68 there is probably a bug in fallback scenarios indexers.
they seem to be one object, but they need to be different objects for proper cold users and items processing.
length of scenario user indexer labels is not equal to the number of users in train dataset. also new users from test were not included to the indexers (at least poprec indexers inside Fallback), but should be.
I believe, the length of user indexer for fb_model in scenario should be 6040 as for pure poprec model.
image

word2vec

estimate maxIter param impact in word2vec model

models tests

add a test for all model which check that model:

  • returns prediction for warm users
  • returns prediction for cold users (if is able to predict cold) / does not fail and returns nothing for cold users
  • returns prediction for new users (if is able to predict new) /does not fail and returns nothing for new users

fallback scenario bug

fallback scenario returns not enough recommendations, see the screenshot.
fallback function from utils works well.

it is probably a bug inside Fallback._predict after we get recommendations from the models' _predict. These recommendations have seen items and are not cropped to the top-K, as both actions are performed inside _predict_wrap. We call fallback to the recommendations before seen filtering and crop.

image
image

SecondLevelFeaturesProcessor refactoring

  • create base class with fit - transform interface
  • create separate classes for stat features, based of log, and conditional popularity features processing
  • add smoothing

`DataPreparator` bug when using user_features and/or item_features

DataPreparator fails to use Indexer.(user|item)_indexer when calling with user_features and/or item_features. Because Indexer.(user|item)_indexer expects calling Indexer.fit() before Indexer.transfrom().

How to reproduce

import pandas as pd

from replay.data_preparator import DataPreparator


df = pd.read_csv(
    "experiments/data/ml1m_ratings.dat", 
    sep="\t", 
    names=["user_id", "item_id", "relevance", "timestamp"]
)
users = pd.read_csv(
    "experiments/data/ml1m_users.dat", 
    sep="\t", 
    names=["user_id", "gender", "age", "occupation", "zip_code"]
)

data_preparator = DataPreparator()
log, user_features, _ = data_preparator(df, users)

Expected behavior

Probably Indexer.fit() should be used earlier in DataPreparator.__call__()

Rework Fallback Scenario

We now have Cold model inside base scenario, this should be extracted and merged with Fallback scenario.

test_all_models indexer bug

test_predict_pairs_warm_only multvae fails with IndexError: index 4 is out of bounds for dimension 0 with size 4

check two-level scenario

  • check for bugs/errors
  • create jupyter notebook illustrating two-level scenario usage and optimisation

update docs structure

Useful info page should not be a subpage for Modules, but be a top page.

Also the definitions for cold users and items in tables in algorithm selection should be detailed. First table refers to new users with little interactions available, and second refers to completely cold users without any history at all.

mulvae bug in predict for cold users

There is a bug in MultVAE, which leads to a weird-looking error which appear during predict for cold user:

File "/home/volodkevich/replay_tasks/current_base/replay/models/mult_vae.py", line 312, in _predict_pairs_inner
    user_batch[0, items_np_history] = 1
IndexError: index -9223372036854775808 is out of bounds for dimension 0 with size 3646

items_np_history is an empty list for cold user, and this code fails. But we actually do not want a model to run predict for cold users, so we need to delete a unnecessary code and filter users before applyInPandas
works for warm:
image

fails for cold:
image

Use arbitrary relevances for KNN and PopRecs

Current implementations for KNN, PopRec, UserpopRec and RandomRec ignore relevance values and treat them all as 1.
New __init__ parameter should be added to support the use of arbitrary relevance weights.

Python 3.6

Python==3.9.7
poetry==1.1.11
pip==21.3.1

pyproject.toml like this

[tool.poetry]
name = "replay-rec"
version = "0.7.0"
description = "RecSys Library"
authors = [""]

[tool.poetry.dependencies]
python = ">=3.6.2, <3.10"
lightautoml = ">=0.3.1"
numpy = [
  {version = ">=1.20.0", python = ">=3.7"},
  {version = "*", python = "<3.7"}
]

Cannot complete poetry lock with the following message

The current project's Python requirement (>=3.6.2,<3.10) is not compatible with some of the required packages Python requirement:
    - numpy requires Python >=3.7,<3.11, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7,<3.11, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7,<3.11, so it will not be satisfied for Python >=3.6.2,<3.7

  Because no versions of numpy match >1.20.0,<1.20.1 || >1.20.1,<1.20.2 || >1.20.2,<1.20.3 || >1.20.3,<1.21.0 || >1.21.0,<1.21.1 || >1.21.1,<1.21.2 || >1.21.2,<1.21.3 || >1.21.3,<1.21.4 || >1.21.4
   and numpy (1.20.0) requires Python >=3.7, numpy is forbidden.
  And because numpy (1.20.1) requires Python >=3.7
   and numpy (1.20.2) requires Python >=3.7, numpy is forbidden.
  And because numpy (1.20.3) requires Python >=3.7
   and numpy (1.21.0) requires Python >=3.7, numpy is forbidden.
  And because numpy (1.21.1) requires Python >=3.7
   and numpy (1.21.2) requires Python >=3.7,<3.11, numpy is forbidden.
  And because numpy (1.21.3) requires Python >=3.7,<3.11
   and numpy (1.21.4) requires Python >=3.7,<3.11, numpy is forbidden.
  Because no versions of lightautoml match >0.3.1
   and lightautoml (0.3.1) depends on numpy (>=1.20.0), lightautoml (>=0.3.1) requires numpy (>=1.20.0).
  Thus, lightautoml is forbidden.
  So, because replay-rec depends on lightautoml (>=0.3.1), version solving failed.

  at ~/python/.e397/lib/python3.9/site-packages/poetry/puzzle/solver.py:241 in _solve
      237│             packages = result.packages
      238│         except OverrideNeeded as e:
      239│             return self.solve_in_compatibility_mode(e.overrides, use_latest=use_latest)
      240│         except SolveFailure as e:
    → 241│             raise SolverProblemError(e)
      242│
      243│         results = dict(
      244│             depth_first_search(
      245│                 PackageNode(self._package, packages), aggregate_package_nodes

  • Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

"optimize" refactoring

  • we use _search_space and param_grid in optuna objective which makes code unclear
  • we fix value for some parameters in _search_space (eg latent_dim in multVAE) and the optimize behaviour is not clear if the fixed value is not equal to the value defined by the user in model initialisation

Need to refactor the code and check if do not change user-defined parameters not included to param_grid/fixed in _search_space with empty param_grid

Exponential time smoothing

We can add some time-awareness to models by applying time-dependent weights on relevance values. This weightening can happen at three places in the recommendation process:

  1. Before model training
  2. At prediction time before get_top_k method
  3. After prediction, as a way to rerank final recommendations

Regardless of the option we choose (or support all of them), we should have functions that calculate these weights.

Arguments should include

  • decay — the "half-life" of a weight, the number of days the weight is reduced by 50%. Probably float.
  • limit — the minimal value the weight can reach, to avoid zeroing very old interactions.

There are two options to calculate weights: for each interaction and for each item.

Both take log with timestamp values as an input, but return values are different.

Item-weights return a new DataFrame mapping item_id to weight.
Interaction-weights modify log relevance values in place.

Neuromf indexer bug

test_predict fails on predict at spark.apply in pandas in _predict_by_user. predict by hand for each user works Oo

RePlay installation failed on Python 3.8.12

I have tried to install Replay lib for python 3.8.12 with pip install replay-rec command into new virtual env. Pip was updated beforehand to 21.3.1 version.

The main problem was with implicit, which tried to find NVCC and CUDA on CPU machine.

Any ideas?

check notebooks

  • notebooks works
  • notebooks are up-to-date
  • leave one of compare_model notebooks
  • check ml_ratings.csv usage in notebooks and drop if is not used

check licences of libraries

find usage "virus" licences in RePlay. If it use in runtime, then remove that libraries.
virus licences
GPL
LGPL
MPL
photo_2021-10-30_15-29-05
photo_2021-11-01_19-04-02

KNN optimize

KNN should have it's own optuna objective.

It should calculate neighbours by maximum search_space value and then just take only top k neighbours when optuna suggests a parameter

FirstLevelFeaturesProcessor refactoring

  • rename to make class name meaningful
  • have a look at the pyspark mllib transformer/estimator interface to decide if inherit from it
  • evaluate code logic and change if necessary

`test_utils` fails

test_process_timestamp fails (probably a timezone problem):

E   AssertionError: DataFrame.iloc[:, 0] (column name="str_") are different
E   
E   DataFrame.iloc[:, 0] (column name="str_") values are different (100.0 %)
E   [left]:  [2021-08-22T00:00:00.000000000, 2021-08-23T11:29:29.000000000, 2021-08-27T00:00:00.000000000]
E   [right]: [2021-08-21T22:00:00.000000000, 2021-08-23T09:29:29.000000000, 2021-08-26T22:00:00.000000000]

create CandidatesGenerator

create a class to RePlay models for candidates generation, which:

  • preprocess log and features
  • train models
  • generate candidates from each model/selected models
  • get fallback prediction and ensemble candidates dataset
  • add relevance/rank from each model
  • add features from selected models
  • add "generated_by" column for each model?

changes in metrics

  • leave top-k recommendations before aggregating to list
  • remove results joining and grouping by k while metric calculation for different K

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.