sberbank-ai-lab / replay Goto Github PK

View Code? Open in Web Editor NEW

60.0 5.0 6.0 12.98 MB

RecSys Library

Home Page: https://sberbank-ai-lab.github.io/RePlay/

License: Apache License 2.0

Shell 0.06% Python 99.94%

machine-learning recsys recommender-systems pyspark pytorch

replay's Introduction

RePlay

RePlay is a library providing tools for all stages of creating a recommendation system, from data preprocessing to model evaluation and comparison.

RePlay uses PySpark to handle big data.

You can

Filter and split data
Train models
Optimize hyper parameters
Evaluate predictions with metrics
Combine predictions from different models
Create a two-level model

Docs

Documentation

Installation

Use Linux machine with Python 3.7+, Java 8+ and C++ compiler.

pip install replay-rec

It is preferable to use a virtual environment for your installation.

If you encounter an error during RePlay installation, check the troubleshooting guide.

replay's People

Contributors

Stargazers

Watchers

Forkers

dev-rinchin mindis lordzeee mindaugaszickus strogo dimkoss11

replay's Issues

cannot import name 'JavaClassificationModel' from 'pyspark.ml.classification'

Ubuntu 18.04.3 LTS (Bionic Beaver)
Python 3.8.5
pyspark==3.2.0
replay-rec==0.6.0

An exception occurs when I try to import RePlay models:

File "/home/rinchin/main.py", line 7, in <module>
    from replay.models import PopRec
  File "/home/rinchin/replay_venv/lib/python3.8/site-packages/replay/models/__init__.py", line 15, in <module>
    from replay.models.classifier_rec import ClassifierRec
  File "/home/rinchin/replay_venv/lib/python3.8/site-packages/replay/models/classifier_rec.py", line 26, in <module>
    from pyspark.ml.classification import (
ImportError: cannot import name 'JavaClassificationModel' from 'pyspark.ml.classification'

NeuroMF bug

After indexing change negative generation for NeuroMF got a bug.
Before: negatives were sampled from [0, num_items) uniform distribution; for now it is from [0, max_item + 1) distribution.
Ids were consecutive before and now they are not, so we can sample absent ids, which can break a training or make it less efficient.
Here:

    def _get_neg_batch(self, batch: Tensor) -> Tensor:
        negative_items = torch.randint(
            0,
            self._item_dim,
            (batch.shape[0] * self.count_negative_sample,),
        )
        return negative_items

create wrapper for lama and base class for other grad.boost models

this is necessary for convenient use in scenarios
standard interface (fit/predict)

Make indexer skip renaming

Indexer will fail if columns are user_idx and item_idx, because it can't rename them to the same value.

Create indexer save/load functions

with indexers extracted we now need separate functions to save indexer. something like

def save_index(indexer, path: str):
    self.user_type = users.schema[self.user_col].dataType
    self.item_type = items.schema[self.item_col].dataType
    indexer.user_indexer.save(join(path, "user_indexer"))
    indexer.item_indexer.save(join(path, "item_indexer"))
    indexer.inv_user_indexer.save(join(path, "inv_user_indexer"))
    indexer.inv_item_indexer.save(join(path, "inv_item_indexer"))


def load_index(path: str):
    indexer = Indexer()
    model.user_indexer = StringIndexerModel.load(join(path, "user_indexer"))
    model.item_indexer = StringIndexerModel.load(join(path, "item_indexer"))
    model.inv_user_indexer = IndexToString.load(join(path, "inv_user_indexer"))
    model.inv_item_indexer = IndexToString.load(join(path, "inv_item_indexer"))

Test RePlay on Spark 3.2

Check all tests and. Ann bug with knn optimize.

LightFM indexer bugs

test_predict cannot perform fit on this data, says {Exception}Number of user feature rows does not equal the number of users. Probably the problem is that user features do not contain all the users in log. Should investigate what is happening in _feature_table_to_csr. Maybe we should inner join log and features? Oo
test_predict_pairs and test_enrich_with_features probably fail on fit too

accelerate l1 reg in SLIM

need to use l1 like in the LAMA (https://github.com/sberbank-ai-lab/AutoMLWhitebox/blob/master/autowoe/lib/selectors/utils.py)

fallback scenario indexers issue

In addition to #68 there is probably a bug in fallback scenarios indexers.
they seem to be one object, but they need to be different objects for proper cold users and items processing.
length of scenario user indexer labels is not equal to the number of users in train dataset. also new users from test were not included to the indexers (at least poprec indexers inside Fallback), but should be.
I believe, the length of user indexer for fb_model in scenario should be 6040 as for pure poprec model.

check for shuffle

check base_rec and models code for redundant joins and groupBy

word2vec

estimate maxIter param impact in word2vec model

Optimize with low budget can return suboptimal parameters

Sometimes default parameters work better than the ones optuna returned because default was never tried. Should be added as a starting point.

usersplitter by date returns less items in tests than required

LAMA team reported that UserSplitter returns less items than required by the user_test_size with shuffle=False.

Users have sufficient interactions, but timestamps could be equal for all/ a part of interactions.

models tests

add a test for all model which check that model:

returns prediction for warm users
returns prediction for cold users (if is able to predict cold) / does not fail and returns nothing for cold users
returns prediction for new users (if is able to predict new) /does not fail and returns nothing for new users

fallback scenario bug

fallback scenario returns not enough recommendations, see the screenshot.
fallback function from utils works well.

it is probably a bug inside Fallback._predict after we get recommendations from the models' _predict. These recommendations have seen items and are not cropped to the top-K, as both actions are performed inside _predict_wrap. We call fallback to the recommendations before seen filtering and crop.

SecondLevelFeaturesProcessor refactoring

create base class with fit - transform interface
create separate classes for stat features, based of log, and conditional popularity features processing
add smoothing

`DataPreparator` bug when using user_features and/or item_features

DataPreparator fails to use Indexer.(user|item)_indexer when calling with user_features and/or item_features. Because Indexer.(user|item)_indexer expects calling Indexer.fit() before Indexer.transfrom().

How to reproduce

import pandas as pd

from replay.data_preparator import DataPreparator


df = pd.read_csv(
    "experiments/data/ml1m_ratings.dat", 
    sep="\t", 
    names=["user_id", "item_id", "relevance", "timestamp"]
)
users = pd.read_csv(
    "experiments/data/ml1m_users.dat", 
    sep="\t", 
    names=["user_id", "gender", "age", "occupation", "zip_code"]
)

data_preparator = DataPreparator()
log, user_features, _ = data_preparator(df, users)

Expected behavior

Probably Indexer.fit() should be used earlier in DataPreparator.__call__()

Remove Ignite from base_torch_rec

Our own for-loop would be more convenient.

Rework Fallback Scenario

We now have Cold model inside base scenario, this should be extracted and merged with Fallback scenario.

test_all_models indexer bug

test_predict_pairs_warm_only multvae fails with IndexError: index 4 is out of bounds for dimension 0 with size 4

check two-level scenario

check for bugs/errors
create jupyter notebook illustrating two-level scenario usage and optimisation

update docs structure

Useful info page should not be a subpage for Modules, but be a top page.

Also the definitions for cold users and items in tables in algorithm selection should be detailed. First table refers to new users with little interactions available, and second refers to completely cold users without any history at all.

DateSplitter not working correctly with int timestamps

DateSplitter(0.8) gives error with movielens dataset trying to compare int and timestamp. Probably split date is timestamp and columns remain to be int.

mulvae bug in predict for cold users

There is a bug in MultVAE, which leads to a weird-looking error which appear during predict for cold user:

File "/home/volodkevich/replay_tasks/current_base/replay/models/mult_vae.py", line 312, in _predict_pairs_inner
    user_batch[0, items_np_history] = 1
IndexError: index -9223372036854775808 is out of bounds for dimension 0 with size 3646

items_np_history is an empty list for cold user, and this code fails. But we actually do not want a model to run predict for cold users, so we need to delete a unnecessary code and filter users before applyInPandas
works for warm:

fails for cold:

Use arbitrary relevances for KNN and PopRecs

Current implementations for KNN, PopRec, UserpopRec and RandomRec ignore relevance values and treat them all as 1.
New __init__ parameter should be added to support the use of arbitrary relevance weights.

Study is not saved with model_handler

Optuna study should be added as a saved and loaded parameter in model_handler.

Update notebooks for indexer usage

Add the ability to work on a cluster (on sbercloud)

Python 3.6

Python==3.9.7
poetry==1.1.11
pip==21.3.1

pyproject.toml like this

[tool.poetry]
name = "replay-rec"
version = "0.7.0"
description = "RecSys Library"
authors = [""]

[tool.poetry.dependencies]
python = ">=3.6.2, <3.10"
lightautoml = ">=0.3.1"
numpy = [
  {version = ">=1.20.0", python = ">=3.7"},
  {version = "*", python = "<3.7"}
]

Cannot complete poetry lock with the following message

The current project's Python requirement (>=3.6.2,<3.10) is not compatible with some of the required packages Python requirement:
    - numpy requires Python >=3.7,<3.11, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7,<3.11, so it will not be satisfied for Python >=3.6.2,<3.7
    - numpy requires Python >=3.7,<3.11, so it will not be satisfied for Python >=3.6.2,<3.7

  Because no versions of numpy match >1.20.0,<1.20.1 || >1.20.1,<1.20.2 || >1.20.2,<1.20.3 || >1.20.3,<1.21.0 || >1.21.0,<1.21.1 || >1.21.1,<1.21.2 || >1.21.2,<1.21.3 || >1.21.3,<1.21.4 || >1.21.4
   and numpy (1.20.0) requires Python >=3.7, numpy is forbidden.
  And because numpy (1.20.1) requires Python >=3.7
   and numpy (1.20.2) requires Python >=3.7, numpy is forbidden.
  And because numpy (1.20.3) requires Python >=3.7
   and numpy (1.21.0) requires Python >=3.7, numpy is forbidden.
  And because numpy (1.21.1) requires Python >=3.7
   and numpy (1.21.2) requires Python >=3.7,<3.11, numpy is forbidden.
  And because numpy (1.21.3) requires Python >=3.7,<3.11
   and numpy (1.21.4) requires Python >=3.7,<3.11, numpy is forbidden.
  Because no versions of lightautoml match >0.3.1
   and lightautoml (0.3.1) depends on numpy (>=1.20.0), lightautoml (>=0.3.1) requires numpy (>=1.20.0).
  Thus, lightautoml is forbidden.
  So, because replay-rec depends on lightautoml (>=0.3.1), version solving failed.

  at ~/python/.e397/lib/python3.9/site-packages/poetry/puzzle/solver.py:241 in _solve
      237│             packages = result.packages
      238│         except OverrideNeeded as e:
      239│             return self.solve_in_compatibility_mode(e.overrides, use_latest=use_latest)
      240│         except SolveFailure as e:
    → 241│             raise SolverProblemError(e)
      242│
      243│         results = dict(
      244│             depth_first_search(
      245│                 PackageNode(self._package, packages), aggregate_package_nodes

  • Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"
    For numpy, a possible solution would be to set the `python` property to ">=3.7,<3.10"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

"optimize" refactoring

we use _search_space and param_grid in optuna objective which makes code unclear
we fix value for some parameters in _search_space (eg latent_dim in multVAE) and the optimize behaviour is not clear if the fixed value is not equal to the value defined by the user in model initialisation

Need to refactor the code and check if do not change user-defined parameters not included to param_grid/fixed in _search_space with empty param_grid

Exponential time smoothing

We can add some time-awareness to models by applying time-dependent weights on relevance values. This weightening can happen at three places in the recommendation process:

Before model training
At prediction time before get_top_k method
After prediction, as a way to rerank final recommendations

Regardless of the option we choose (or support all of them), we should have functions that calculate these weights.

Arguments should include

decay — the "half-life" of a weight, the number of days the weight is reduced by 50%. Probably float.
limit — the minimal value the weight can reach, to avoid zeroing very old interactions.

There are two options to calculate weights: for each interaction and for each item.

Both take log with timestamp values as an input, but return values are different.

Item-weights return a new DataFrame mapping item_id to weight.
Interaction-weights modify log relevance values in place.

Extract indexer into data preparator

`poetry install` fail with Python 3.9.7

Works well for Python 3.7.11, but for 3.9.7 hangs on pandas

Neuromf indexer bug

test_predict fails on predict at spark.apply in pandas in _predict_by_user. predict by hand for each user works Oo

unify filter_seen in _predict

RePlay/replay/models/base_rec.py

Line 457 in 85f0a17

recs = self._predict(

We pass filter_seen into _predict and the filter again. We should remove this parameter from _predict or do not filter after _predict. We should filter inside _predict if it is possible

RePlay installation failed on Python 3.8.12

I have tried to install Replay lib for python 3.8.12 with pip install replay-rec command into new virtual env. Pip was updated beforehand to 21.3.1 version.

The main problem was with implicit, which tried to find NVCC and CUDA on CPU machine.

Any ideas?

check notebooks

notebooks works
notebooks are up-to-date
leave one of compare_model notebooks
check ml_ratings.csv usage in notebooks and drop if is not used

Remove redundant models

stack
classifier

rename to make class name meaningful
have a look at the pyspark mllib transformer/estimator interface to decide if inherit from it
evaluate code logic and change if necessary

E   AssertionError: DataFrame.iloc[:, 0] (column name="str_") are different
E   
E   DataFrame.iloc[:, 0] (column name="str_") values are different (100.0 %)
E   [left]:  [2021-08-22T00:00:00.000000000, 2021-08-23T11:29:29.000000000, 2021-08-27T00:00:00.000000000]
E   [right]: [2021-08-21T22:00:00.000000000, 2021-08-23T09:29:29.000000000, 2021-08-26T22:00:00.000000000]

add doc page related to data preprocessing and feature generation

describe classes from data_preparator and history_based_fp

create CandidatesGenerator

create a class to RePlay models for candidates generation, which:

preprocess log and features
train models
generate candidates from each model/selected models
get fallback prediction and ensemble candidates dataset
add relevance/rank from each model
add features from selected models
add "generated_by" column for each model?

changes in metrics

leave top-k recommendations before aggregating to list
remove results joining and grouping by k while metric calculation for different K