GithubHelp home page GithubHelp logo

sarplus's Introduction

sarplus (preview)

pronounced sUrplus as it's simply better if not best!

Build Status PyPI version

Features

  • Scalable PySpark based implementation
  • Fast C++ based predictions
  • Reduced memory consumption: similarity matrix cached in-memory once per worker, shared accross python executors
  • Easy setup using Spark Packages

Benchmarks

# Users # Items # Ratings Runtime Environment Dataset
2.5mio 35k 100mio 1.3h Databricks, 8 workers, Azure Standard DS3 v2 (4 core machines)

Top-K Recommendation Optimization

There are a couple of key optimizations:

  • map item ids (e.g. strings) to a continuous set of indexes to optmize storage and simplify access
  • convert similarity matrix to exactly the representation the C++ component needs, thus enabling simple shared, memory mapping of the cache file and avoid parsing. This requires a customer formatter, written in Scala
  • shared read-only memory mapping allows us to re-use the same memory from multiple python executors on the same worker node
  • partition the input test users and past seen items by users, allowing for scale out
  • perform as much of the work as possible in PySpark (way simpler)
  • top-k computation ** reverse the join by summing reverse joining the users past seen items with any related items ** make sure to always just keep top-k items in-memory ** use standard join using binary search between users past seen items and the related items

Image of sarplus top-k recommendation optimization

Usage

import pandas as pd
from pysarplus import SARPlus

# spark dataframe with user/item/rating/optional timestamp tuples
train_df = spark.createDataFrame(
      pd.DataFrame({
        'user_id': [1, 1, 2, 3, 3],
        'item_id': [1, 2, 1, 1, 3],
        'rating':  [1, 1, 1, 1, 1],
    }))
   
# spark dataframe with user/item tuples
test_df = spark.createDataFrame(
      pd.DataFrame({
        'user_id': [1, 3],
        'item_id': [1, 3],
        'rating':  [1, 1],
    }))
    
model = SARPlus(spark, col_user='user_id', col_item='item_id', col_rating='rating', col_timestamp='timestamp')
model.fit(train_df, similarity_type='jaccard')


model.recommend_k_items(test_df, 'sarplus_cache', top_k=3).show()

# For databricks
# model.recommend_k_items(test_df, 'dbfs:/mnt/sarpluscache', top_k=3).show()

Jupyter Notebook

Insert this cell prior to the code above.

import os

SUBMIT_ARGS = "--packages eisber:sarplus:0.2.5 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("sample")
    .master("local[*]")
    .config("memory", "1G")
    .config("spark.sql.shuffle.partitions", "1")
    .config("spark.sql.crossJoin.enabled", True)
    .config("spark.ui.enabled", False)
    .getOrCreate()
)

PySpark Shell

pip install pysarplus
pyspark --packages eisber:sarplus:0.2.5 --conf spark.sql.crossJoin.enabled=true

Databricks

One must set the crossJoin property to enable calculation of the similarity matrix (Clusters / < Cluster > / Configuration / Spark Config)

spark.sql.crossJoin.enabled true
  1. Navigate to your workspace
  2. Create library
  3. Under 'Source' select 'Maven Coordinate'
  4. Enter 'eisber:sarplus:0.2.5' or 'eisber:sarplus:0.2.6' if you're on Spark 2.4.1
  5. Hit 'Create Library'
  6. Attach to your cluster
  7. Create 2nd library
  8. Under 'Source' select 'Upload Python Egg or PyPI'
  9. Enter 'pysarplus'
  10. Hit 'Create Library'

This will install C++, Python and Scala code on your cluster.

You'll also have to mount shared storage

  1. Create Azure Storage Blob

  2. Create storage account (e.g. )

  3. Create container (e.g. sarpluscache)

  4. Navigate to User / User Settings

  5. Generate new token: enter 'sarplus'

  6. Use databricks shell (installation here)

  7. databricks configure --token 4.1. Host: e.g. https://westus.azuredatabricks.net

  8. databricks secrets create-scope --scope all --initial-manage-principal users

  9. databricks secrets put --scope all --key sarpluscache 6.1. enter Azure Storage Blob key of Azure Storage created before

  10. Run mount code

dbutils.fs.mount(
  source = "wasbs://sarpluscache@<accountname>.blob.core.windows.net",
  mount_point = "/mnt/sarpluscache",
  extra_configs = {"fs.azure.account.key.<accountname>.blob.core.windows.net":dbutils.secrets.get(scope = "all", key = "sarpluscache")})

Disable annoying logging

import logging
logging.getLogger("py4j").setLevel(logging.ERROR)

Packaging

For databricks to properly install a C++ extension, one must take a detour through pypi. Use twine to upload the package to pypi.

cd python

python setup.py sdist

twine upload dist/pysarplus-*.tar.gz

On Spark one can install all 3 components (C++, Python, Scala) in one pass by creating a Spark Package. Documentation is rather sparse. Steps to install

  1. Package and publish the pip package (see above)
  2. Package the Spark package, which includes the Scala formatter and references the pip package (see below)
  3. Upload the zipped Scala package to Spark Package through a browser. sbt spPublish has a few issues so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, issue) and furthermore fail to install if published to Spark-Packages.org.
cd scala
sbt spPublish

Testing

To test the python UDF + C++ backend

cd python 
python setup.py install && pytest -s tests/

To test the Scala formatter

cd scala
sbt test

(use ~test and it will automatically check for changes in source files, but not build.sbt)

sarplus's People

Contributors

dciborow avatar eisber avatar microsoft-github-policy-service[bot] avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sarplus's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.