Recommendation system using Alternating Least Squares(ALS) and Cosine Similarity on PySpark and Elasticsearch
There are basically three types of recommendation systems:
- Content-Based Filtering
- Collaborative Filtering
- Hybrid
The attributes or characteristics of the items are taken into account to carry out the recommendation. For example, if we’re looking to recommend songs, we’ll look at the genre, duration, singer, and various other attributes that make up the item.
Pro:
- Requieres less data
- It is not necessary to identify users with similar preferences.
- It does not suffer from the cold start problem, a known issue in recommender systems that addresses the algorithm’s inability to recommend items or users for which it does not have enough information.
Cons:
- Suffer from a lack of diversity, that is, they can only recommend items that are strictly similar.
- Depend on the data filled in correctly and on the correct feeding of the systems.
- If items have the same characteristics, they will be treated as equal.
Analyze the preferences of other users to make recommendations, divided into two types:
Similarity matrices between all users or items. By identifying this similarity, it is possible to recommend new items.
There are several ways of computing similarity between vectors, such as euclidean, minkowski, jaccard etc., cosine similarity (which is a measure of similarity between two vectors).
the most similar a vector can be to the other is when the angle between them is 0º, where the cosine has a value of 1.
For instance, user-movie (or movie-user) interaction matrix (where each entry records an interaction of a user i and a movie j), in a real world setting because the vast majority of movies receive very few or even no ratings at all by users, is an extremely sparse matrix:
With such a sparse matrix, what ML algorithms can be trained and reliable to make inference? To find solutions we use Matrix factorization.
Matrix factorization is a factorization of a matrix into a product of matrices:
One matrix can be seen as the user matrix where rows represent users and columns are attributes or characteristics (latent factor). The other matrix is the item matrix where rows are attributes or characteristics and columns represent items.
This allows model to predict better personalized movie ratings for users, e.g. less-known movies can have rich latent representations as much as popular movies.
TODO: Alternating Least Square (ALS) with Spark ML
Setup 3-node Spark cluster and single node Elasticsearch with:
docker-compose up -d --build
Then run Jupyter notebook.