This document describes the Movie Recommender API and it's features.
Was developed on a machine with sbt 1.8.2, scala 2.13.7, java version 1.8.0_351
The service comprises two parts
- A movie recommendation service which generates a recommendations db - built in Spark
- The API service which exposes a RESTful API to serve these recommendations - built in Cats IO
The API is dependent on the existence of db.json
generated by the Spark job to serve results
To save time the recommendations are being read from a raw JSON file, this is obviously not suitable for production and in order to scale this solution more components (Redis for example) would need to be bought into the fold.
Questions around how much the movie database is being updated would also determine how often we need to run the Spark job, as it only needs to be run periodically. Real time recommendations for new additions would be a different problem, requiring a streaming hashable lookup solution or similar
from movie-recommendation-system root run
make test
This will run all the service tests via sbt test
First we need to actually generate the recommendations from the metadatas.json
file
N.B: Spark can be quite picky about Java/Scala versions - so we run it in docker
To run the recommendations job, from movie-recommendation-system root run:
make build_recs_docker
which will attempt in docker
If you wish to run it on your local machine instead
make build_recs
This will output to a db.json
of recommendations via sbt build_recs
Notes: This is written as a Spark ML pipeline using feature hashing and a similarity matrix
In a real production environment we would not be using a text file for the lookup but most likely a key value store like Redis. Places where this has impacted design choices are outlined in the code
Spark was chosen as we can scale this appropriately - from experience even using a relatively naive algorithm such as an n^2 Euclidean comparison like we've chosen, a 20 node cluster can deliver recommendations for 6 million items in about 2.5 hours.
The method we've used could be made faster and more computationally feasable for huge datasets by using local sensitivity hashing or another probabilistic data structure / join algorithm
To run the recommendations API, from movie-recommendation-system root run
make run_server
Starts up the web server on http://0.0.0.0:8080
via sbt run_server
The API is serviced by the following RESTful endpoints:
/<int:movie_id>
- return 3 recommendations and their relevance for a given movie_id
Example GET requests:
-
curl -X GET http://0.0.0.0:8080/2
Returns{"id":2,"recommended":[15,113,58],"relevance":[1,2,3]}
So if you like The Godfather, you may like Casino, The Godfather II and Goodfellas (id's 15, 113 and 58) Pretty good! -
curl -X GET http://0.0.0.0:8080/140
Returns{"id":20,"recommended":[31,45,128],"relevance":[1,2,3]}
So if you like The Exorcist, you may like The Shining, The Thing and The Help -
curl -X GET http://0.0.0.0:8080/1"
Returns{"id":1,"recommended":[58,105,124],"relevance":[1,2,3]}
So if you like The Shawshank Redemption, you may like Goodfellas, The Help and Dogville. -
curl -X GET http://0.0.0.0:8080/96"
Returns{"id":96,"recommended":[70,6,24],"relevance":[1,2,3]}
So if you like The Wizard of Oz you may like The Princess Bride, The General and The Message
There are some TODO's still in the code which would be done is given more time - exception handling could be cleaner in some cases
The Spark code needs testing - building the test infrastructure around Spark is quite tedious, so it has been left. Snapshot tests would probably fit best and have them run over a small data input with low level of parallelism as to not slow down CI
More requirements would be needed to be gathered in order to progress on others. Educated guesses/assumptions have been made where possible