GithubHelp home page GithubHelp logo

product-recommendation-system's Introduction

Amazon Product Recommendation from Reviews Dataset

Introduction:

In the current digital landscape of burgeoning data and the possibility to shop from the comfort of your home, the product catalogs on e-commerce websites are imploding. While this presents a good business opportunity for these companies, this can quickly turn into a disaster if each customer finds themselves lost in this online marketplace, unable to get the product that they are looking to buy.

To mitigate this problem, we propose to design a recommendation system - personalized to the taste of each customer - old or new! Such a recommendation system that caters to each the needs of each user and provides them with the promised shopping experience, leading to improved numbers for customer retention and satisfaction.

To build this recommendation engine, we have carefully cherry-picked the components - from the datastore to be used to how the models will be trained and deployed to make the system scalable and robust.

DOMAIN: E-Commerce

DATASET USED:

https://nijianmo.github.io/amazon/index.html#complete-data

Data Description:

Structure of a Review reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
asin - ID of the product, e.g. 0000013714
reviewerName - name of the reviewer
helpful - helpfulness rating of the review, e.g. 2/3
reviewText - text of the review
overall - rating of the product
summary - summary of the review
unixReviewTime - time of the review (unix time)
reviewTime - time of the review (raw)

Structure of metadata asin - ID of the product, e.g. 0000031852
title - name of the product
price - price in US dollars (at time of crawl)
imUrl - url of the product image
related - related products (also bought, also viewed, bought together, buy after viewing)
salesRank - sales rank information
brand - brand name
categories - list of categories the product belongs to

Methodology:

Storage- The idea is to download and store the dataset in mongoDB and use ElasticSearch storage for scalability.

Data Preprocessing- We would use Spark to preprocess the data which would entail all the processes from cleaning to evaluating useful features on which the recommendation model would be trained.

Training the model- Using sparkML, we would train the embeddings from data stored on ElasticSearch.

Recommendation- Using Elasticsearch queries, we will generate some example recommendations.

FrontEnd- We will display the results using a frontend application which would cover some recommendations of the product for a particular user.

Implementation Details (Milestone Submission)

The Collections.py file present under Database folder, contains a Python script which loads the collections-"DigitalMusic" , "MoviesAndTv" ,"VinylAndCD" from the Amazon Reviews Dataset into MongoDB server.

Steps to run the code:

Connection to MongoDB: Create a mongoDB account and follow the installation instructions (make sure to whitelist your IP) Install mongo shell and pymongo on your system (you may want to either install directly or execution binaries for successful installation of components)

MongoDB cluster web UI

On the online free tier version: This is what the interface would look like once you click on connect.

Follow the processes to copy the connection string and create a client node as show in the code below. Code: usr = os.getenv("USR") password = os.getenv("PASS")

client=pymongo.MongoClient(f'mongodb+srv://{usr}:{password}@project.fq5sh6t.mongodb.net/?retryWrites=true&w=majority')

Once successfully connected, you should be able to connect to the project you create and read the database.

db = client["reviews"] is the database reviews fetched from the client request.

Note: You would need to create a .env file in your project of the format:

USR: PASS: LOCAL:

We need to create a .env file hosting the path variables and the usr and password variables that you will create while signing onto the MongoDB cluster.

Creating a database:

Once the connection is successful. In the project terminal, run the script: python collection.py

If the path to the datasets in the env is set correctly, you will be able to see the database being pushed onto the MongoDB cluster which can be later fetched for data processing for spark.

Test:

Run showcollection.py file to check if the connection to the database is successful.

This is the output which you might expect to see if things run smoothly.

dataframe example

Cleaning the data for spark:

Run script.py to generate 2 files metaVinyAndCD.txt and VinylAndCD.txt as inputs to the spark query processing file and mention the correct paths.

Execute Command: python script.py

These are the first 20+ records that you might expect after cleaning the data using the above command in the file VinyAndCD.txt

dataset

Data Processing in Spark:

The File: P2PRecommendations.ipynb contains the data processing component using Spark for product to product recommendations.

Here 2 types of reading mechanisms are implemented, one slightly more complicated than the other.

Method 1: (Easy) Reading from the text files and that were generated by script.py file by giving correct paths.

We used google collab for this project which reads the data using standard spark.read.json command with a path to the file.

Ex: Schema definition

Make sure to give the correct path to your cleaned data using the command df = spark.read.json(“”)

Method 2: (Complicated) and might cause connection timeout errors if using collab since it recalibrates runtime when requesting large data over the network.

Connecting to MongoDB

Components: Uri: is the mongo string that is a unique identifier to your database. Add your connection string if you want to run the code through MongoDB else it won’t work. This field is deliberately hidden because of security and privacy issues of the database.

The spark.read.format(“mongo”): part of the script assumes that you are running your pyspark-mongo shell from the terminal on which you hosted your notebook. (Read the documentation to set that up before initializing the notebook)

Collab requires separate configuration based on the relative access given.

Query processing: Run each cell to generate the most optimized query processing models for generating product to product based recommendation.

product-recommendation-system's People

Contributors

parth166 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.